# **Video Summarization Project**


This notebook demonstrates the process of extracting audio from a YouTube video, but ve can also use our own video if we want, transcribing the audio to text, and summarizing the text using Gemini AI models (Using API). This project showcases the use of `pytube`/`yt-dlp` for downloading videos, `ffmpeg` for audio extraction, Hugging Face's `transformers` whisper model for speech recognition, and Google's Gemini model for text summarization and analysis.


## **Step 1: Download YouTube Video**

In [None]:
# Install pytube library for downloading YouTube videos

# !pip install pytube

!pip install yt-dlp

Collecting yt-dlp
  Downloading yt_dlp-2024.7.16-py3-none-any.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m30.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting brotli (from yt-dlp)
  Downloading Brotli-1.1.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.0/3.0 MB[0m [31m65.1 MB/s[0m eta [36m0:00:00[0m
Collecting mutagen (from yt-dlp)
  Downloading mutagen-1.47.0-py3-none-any.whl (194 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.4/194.4 kB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pycryptodomex (from yt-dlp)
  Downloading pycryptodomex-3.20.0-cp35-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m60.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting requests<3,>=2.32.2 (f

In [None]:
# Function to download a YouTube video as audio-only and save it to a specified path. Two different method, using different libraries
# if one doesn't work due to some changes in library, please check other function

# from pytube import YouTube
# def download_youtube_video(url, download_path='videos'):
#     yt = YouTube(url)
#     video = yt.streams.filter(only_audio=True).first()
#     video_path = video.download(output_path=download_path)

#     return video_path


import yt_dlp
import os
import re

def download_youtube_video(url, download_path='videos'):
    ydl_opts = {
        'format': 'bestvideo[height<=480]+bestaudio[ext=m4a]/mp4',
        'outtmpl': f'{download_path}/%(title)s.%(ext)s',
        'merge_output_format': 'mp4',  # Ensure the merged output is in MP4 format
        'http_headers': {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0'
        }
    }

    # Ensure the download directory exists
    os.makedirs(download_path, exist_ok=True)

    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        info_dict = ydl.extract_info(url, download=True)
        video_title = info_dict.get('title', None)
        if video_title:
            video_path = os.path.join(download_path, f"{video_title}.mp4")
        else:
            video_path = None

    return video_path


In [None]:
# YouTube URL and download path

youtube_url = 'https://www.youtube.com/watch?v=TQQlZhbC5ps&t=249s'
download_path = '/content/videos'
video_path = download_youtube_video(youtube_url, download_path)
# download_youtube_video(youtube_url, download_path)

[youtube] Extracting URL: https://www.youtube.com/watch?v=TQQlZhbC5ps&t=249s
[youtube] TQQlZhbC5ps: Downloading webpage
[youtube] TQQlZhbC5ps: Downloading ios player API JSON
[youtube] TQQlZhbC5ps: Downloading player 8eff86d5
[youtube] TQQlZhbC5ps: Downloading m3u8 information
[info] TQQlZhbC5ps: Downloading 1 format(s): 244+140
[download] Destination: /content/videos/Transformer Neural Networks - EXPLAINED! (Attention is all you need).f244.webm
[download] 100% of   12.26MiB in 00:00:00 at 17.11MiB/s  
[download] Destination: /content/videos/Transformer Neural Networks - EXPLAINED! (Attention is all you need).f140.m4a
[download] 100% of   12.11MiB in 00:00:00 at 30.43MiB/s  
[Merger] Merging formats into "/content/videos/Transformer Neural Networks - EXPLAINED! (Attention is all you need).mp4"
Deleting original file /content/videos/Transformer Neural Networks - EXPLAINED! (Attention is all you need).f244.webm (pass -k to keep)
Deleting original file /content/videos/Transformer Neural N

In [None]:
print("/content/videos/Transformer Neural Networks - EXPLAINED! (Attention is all you need).mp4")
print(video_path)

/content/videos/Transformer Neural Networks - EXPLAINED! (Attention is all you need).mp4
/content/videos/Transformer Neural Networks - EXPLAINED! (Attention is all you need).mp4


## **Step 2: Extract Audio from Video**

This step uses `ffmpeg` to extract the audio from the downloaded video file and save it as an MP3 file.


In [None]:
# Install ffmpeg-python library for audio extraction

!pip install --upgrade ffmpeg-python
import ffmpeg

Collecting ffmpeg-python
  Downloading ffmpeg_python-0.2.0-py3-none-any.whl (25 kB)
Installing collected packages: ffmpeg-python
Successfully installed ffmpeg-python-0.2.0


In [None]:
# Function to extract audio from the downloaded video file

def extract_audio_from_video(video_path, audio_path):
    print(video_path, audio_path)
    if not os.path.isfile(video_path):
        print(f"Error: Video file does not exist: {video_path}")
        return

    try:
        (
            ffmpeg
            .input(video_path)
            .output(audio_path, format='mp3')
            .run(overwrite_output=True, capture_stdout=True, capture_stderr=True)
        )
        print(f"Audio extracted successfully to: {audio_path}")
    except ffmpeg.Error as e:
        print(f"FFmpeg error: {e.stderr.decode('utf8')}")
    except Exception as e:
        print(f"An unexpected error occurred: {str(e)}")

In [None]:
# Define the path for the extracted audio file

audio_path = video_path.replace(".mp4", ".mp3")
extract_audio_from_video(video_path, audio_path)

/content/videos/Transformer Neural Networks - EXPLAINED! (Attention is all you need).mp4 /content/videos/Transformer Neural Networks - EXPLAINED! (Attention is all you need).mp3
Audio extracted successfully to: /content/videos/Transformer Neural Networks - EXPLAINED! (Attention is all you need).mp3


## **Step 3: Transcribe Audio to Text**

This step uses the `whisper` model from Hugging Face to transcribe the audio chunks into text. The entire audio is processed in chunks to ensure smooth processing and better handling of longer audio files.


In [None]:
# Install necessary libraries for speech recognition

!pip install --upgrade pip
!pip install --upgrade git+https://github.com/huggingface/transformers.git accelerate


Collecting pip
  Downloading pip-24.1.2-py3-none-any.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m21.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-24.1.2
Collecting git+https://github.com/huggingface/transformers.git
  Cloning https://github.com/huggingface/transformers.git to /tmp/pip-req-build-x8vsj2_r
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers.git /tmp/pip-req-build-x8vsj2_r
  Resolved https://github.com/huggingface/transformers.git to commit b31d59504003c8140adf66a4077b1c50799fbe89
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting accelerate
  Downloadi

### Download model and create pipline

In [None]:
# Import required modules from transformers

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

# Determine the device to use for processing (GPU if available, otherwise CPU)
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Define the model ID for Whisper
model_id = "openai/whisper-large-v3"

# Load the Whisper model and processor
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)


# Create a pipeline for automatic speech recognition
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=350,
    torch_dtype=torch_dtype,
    device=device,
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.27k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.90k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.07k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## **Step 4: Get Transcription of Audio**

This step uses the `whisper` model from Hugging Face to transcribe the audio chunks into text. The entire audio is processed in chunks to ensure smooth processing and better handling of longer audio files.


In [None]:
# Install pydub for audio manipulation

!pip install pydub
from pydub import AudioSegment
import numpy as np

Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Installing collected packages: pydub
Successfully installed pydub-0.25.1
[0m

In [None]:
# Function to extract a specific chunk of audio from the file

def get_audio_chunk(audio_path, start_ms, end_ms):
    audio = AudioSegment.from_file(audio_path)
    chunk = audio[start_ms:end_ms]
    return chunk

In [None]:
# Function to transcribe an audio chunk using the Whisper pipeline

def transcribe_audio_chunk(pipe, audio_chunk, sample_rate=16000):
    audio_chunk = audio_chunk.set_frame_rate(sample_rate).set_channels(1)
    audio_data = np.array(audio_chunk.get_array_of_samples(), dtype=np.float32)
    return pipe(audio_data, generate_kwargs={"language": "english"})

In [None]:
# Process the entire audio file in chunks to get the full transcription

result = []
audio = AudioSegment.from_file(audio_path)
length_seconds = len(audio) / 1000
for i in range(0, int(length_seconds+1), 120):
  start_ms = i * 1000
  end_ms = (i + 120) * 1000
  # Get the audio chunk
  audio_chunk = get_audio_chunk(audio_path, start_ms, end_ms)
  result.append(transcribe_audio_chunk(pipe, audio_chunk)["text"])

# Combine all chunks into a single transcription
transcription = ''.join(result)

You have passed language=english, but also have set `forced_decoder_ids` to [[1, None], [2, 50360]] which creates a conflict. `forced_decoder_ids` will be ignored in favor of language=english.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


## **Step 5: Summarize the Transcription**

This step uses Google's Gemini model to perform various analyses on the transcription. The functions `abstract_summary_extraction`, `key_points_extraction`, `action_item_extraction`, and `sentiment_analysis` generate an abstract, key points, action items, and sentiment analysis, respectively.


### Gemini model config

In [None]:
# Import Google Generative AI library
import google.generativeai as genai
import os
from google.colab import userdata

# Configure the API key for the Gemini model
GOOGLE_API_KEY = userdata.get("GOOGLE_API_KEY")
genai.configure(api_key=GOOGLE_API_KEY)
model = genai.GenerativeModel('gemini-1.5-flash')

### Define functions for different types of text analysis

In [None]:
"""Function to generate abstract"""
def abstract_summary_extraction(transcription):
  prompt = (
        "You are a highly skilled AI trained in language comprehension and summarization. "
        "I would like you to read the following text and summarize it into a concise abstract paragraph. "
        "Aim to retain the most important points, providing a coherent and readable summary that could help a person "
        "understand the main points of the discussion without needing to read the entire text. "
        "Please avoid unnecessary details or tangential points.\n\n"
        f"Text:\n{transcription}\n\nSummary:"
    )

  response = model.generate_content(prompt)
  return response.text

In [None]:
def key_points_extraction(transcription):
    prompt = (
        "You are a proficient AI with a specialty in distilling information into key points. "
        "Based on the following text, identify and list the main points that were discussed or brought up. "
        "These should be the most important ideas, findings, or topics that are crucial to the essence of the discussion. "
        "Your goal is to provide a list that someone could read to quickly understand what was talked about.\n\n"
        f"Text:\n{transcription}\n\nKey Points:"
    )

    response = model.generate_content(prompt)
    return response.text

In [None]:
def action_item_extraction(transcription):
    prompt = (
        "You are an AI expert in analyzing conversations and extracting action items. "
        "Please review the text and identify any tasks, assignments, or actions that were agreed upon or mentioned as needing to be done. "
        "These could be tasks assigned to specific individuals, or general actions that the group has decided to take. "
        "Please list these action items clearly and concisely.\n\n"
        f"Text:\n{transcription}\n\nAction Items:"
    )

    response = model.generate_content(prompt)
    return response.text

In [None]:
def sentiment_analysis(transcription):
    prompt = (
        "As an AI with expertise in language and emotion analysis, your task is to analyze the sentiment of the following text. "
        "Please consider the overall tone of the discussion, the emotion conveyed by the language used, and the context in which words and phrases are used. "
        "Indicate whether the sentiment is generally positive, negative, or neutral, and provide brief explanations for your analysis where possible.\n\n"
        f"Text:\n{transcription}\n\nSentiment Analysis:"
    )

    response = model.generate_content(prompt)
    return response.text

In [None]:
def detailed_summery(transcription):
    prompt = (f'''Summarize the following video transcription in detail. Ensure that you cover the following aspects:
            1. Provide an introduction that includes the title, main topic, and purpose of the video.
            2. Divide the transcription into sections based on topic changes, speakers, or segments, and summarize each section.
            3. Highlight all key points, arguments, data, statistics, examples, and anecdotes.
            4. Extract important quotes, definitions, step-by-step processes, and instructions.
            5. Note significant visual or audio elements such as slides, graphics, demonstrations, and changes in tone or emotion.
            6. List any action items, recommendations, or next steps given in the video.
            7. Conclude with the speaker’s closing remarks, calls to action, and information on additional resources or contacts.

            Transcription:
            [{transcription}]''')
    response = model.generate_content(prompt)
    return response.text

In [None]:
# Generate different types of analysis from the transcription

abstract = abstract_summary_extraction(transcription)
key_points = key_points_extraction(transcription)
action_items = action_item_extraction(transcription)
sentiment = sentiment_analysis(transcription)
detailed_summery = detailed_summery(transcription)

In [None]:
print(key_points)

In [None]:
print(key_points)

In [None]:
print(action_items)

In [None]:
print(sentiment)

In [None]:
print(detailed_summery)

## Understanding Transformer Neural Networks: A Detailed Video Summary

**Introduction:** 

This video, titled "[Recurrent neural nets. They are feed-forward neural networks rolled out over time...]"  dives into the world of Transformer neural networks. The speaker aims to explain how these networks work and why they've become so popular, especially in replacing traditional recurrent neural networks (RNNs) for tasks involving sequence data. 

**Section 1: Introduction to Recurrent Neural Networks (RNNs)**

* **Types of RNNs:** The speaker first introduces three main types of RNNs:
    * **Vector-to-sequence:** Input is a fixed-size vector, output is a sequence (e.g., image captioning).
    * **Sequence-to-vector:** Input is a sequence, output is a fixed-size vector (e.g., sentiment analysis).
    * **Sequence-to-sequence:** Input is a sequence, output is another sequence (e.g., language translation).
* **Limitations of RNNs:** The speaker highlights two major drawbacks of RNNs:
    * *

In [None]:
# Video Summarization Project

This project demonstrates the process of extracting audio from a YouTube video, transcribing the audio to text, and summarizing the text using various AI models. The project showcases the use of `pytube` for downloading videos, `ffmpeg` for audio extraction, Hugging Face's `transformers` for speech recognition, and Google's Gemini model for text summarization and analysis.

## Project Description

The project is divided into several steps:
1. **Download YouTube Video**: Use `pytube` to download the audio stream of a YouTube video.
2. **Extract Audio from Video**: Use `ffmpeg` to extract the audio from the downloaded video file.
3. **Transcribe Audio to Text**: Use Hugging Face's `whisper` model to convert the audio into text.
4. **Summarize the Transcription**: Use Google's Gemini model to generate an abstract, key points, action items, and sentiment analysis from the transcription.

## Usage Instructions

To run this notebook, follow these steps:

1. **Clone the Repository**:
    ```bash
    git clone https://github.com/yourusername/your-repo-name.git
    cd your-repo-name
    ```

2. **Install Dependencies**:
    Ensure you have Python and pip installed, then install the required packages:
    ```bash
    pip install pytube ffmpeg-python pydub transformers accelerate
    ```

3. **Run the Notebook**:
    Open the notebook in Jupyter or Google Colab and execute the cells step by step.

4. **API Key Configuration**:
    Ensure you have your Google API key set up in the environment. In Google Colab, you can set it like this:
    ```python
    from google.colab import userdata
    GOOGLE_API_KEY = userdata.get("GOOGLE_API_KEY")
    ```

## Example Outputs

Here are example outputs for each step:

1. **Downloading YouTube Video**:
    ```python
    youtube_url = 'https://www.youtube.com/watch?v=ySus5ZS0b94&t=195s'
    download_path = 'videos'
    video_path = download_youtube_video(youtube_url, download_path)
    ```

2. **Extracting Audio from Video**:
    ```python
    audio_path = video_path.replace(".mp4", ".mp3")
    extract_audio_from_video(video_path, audio_path)
    ```

3. **Transcribing Audio to Text**:
    ```python
    transcription = ''.join(result)
    print(transcription)
    ```

4. **Summarizing the Transcription**:
    ```python
    abstract = abstract_summary_extraction(transcription)
    key_points = key_points_extraction(transcription)
    action_items = action_item_extraction(transcription)
    sentiment = sentiment_analysis(transcription)
    ```

## Dependencies

The following libraries are used in this project:
- `pytube`: `>=12.1.0`
- `ffmpeg-python`: `>=0.2.0`
- `pydub`: `>=0.24.1`
- `transformers`: `>=4.10.0`
- `accelerate`: `>=0.5.1`
- `google-generativeai`: For using Google's Gemini model.

