<a href="https://www.kaggle.com/code/debajyotidas/generatingyoutubevideosummariesusingllms?scriptVersionId=234268918" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

YouTube has educational content on pretty much any topic, from academic subjects like math and programming to hands-on projects, tutorials, and preparation for professional certifications.

But as a learning tool, YouTube isn't perfect. Some videos have too many ads and sponsorship interruptions, some are slowed down by non-essential information, while others require viewers to pause frequently just to follow the steps.

Imagine if we could get a concise video summary, review it to determine whether it's worth watching, extract step-by-step guidance so we could easily follow along, and at the end, generate a quiz to test our understanding. Wouldn't that be awesome?

In this tutorial, we will be doing exactly that!
We will use open-source ML models from Hugging Face, and OpenAI's ChatGPT APIs.
We should also be able to apply these steps to other use cases by selecting different ML models or adjusting ChatGPT prompts.

## Let us first use pip to install all the packages required to complete this tutorial.

In [2]:
#installing libraries
!pip install youtube_transcript_api
!pip install transformers
!pip install openai



## Next, let's import all the necessary dependencies.

In [1]:
#importing dependencies
import re
import os
import openai
import textwrap
from youtube_transcript_api import YouTubeTranscriptApi
from transformers import pipeline, AutoTokenizer
from openai import OpenAI

## Now we're ready to work on our first task, which is to obtain a transcript of a YouTube video.
#### One can choose any YouTube video and replace the link in the youtube_url variable. To get a YouTube video url, copy the URL up to the "&" sign.

#### Note: It is recommended that we use a video that is under 30 minutes. This will allow us to complete the tutorial more quickly, as executing commands for longer videos will take more time.

The below code checks if the URL link is valid and then uses the YouTubeTranscriptApi.get_transcript(video_id) method to retrieve the YouTube transcript using the YouTube API. This method provides accurate and official captions associated with the video.

In [None]:
# Specify the YouTube video URL
youtube_url = "https://www.youtube.com/watch?v=b9rs8yzpGYk"

# Extract the video ID from the URL using regular expressions
match = re.search(r"v=([A-Za-z0-9_-]+)", youtube_url)
if match:
    video_id = match.group(1)
else:
    raise ValueError("Invalid YouTube URL")

# Get the transcript from YouTube
try:
    youtubeapi = YouTubeTranscriptApi()
    transcript =  youtubeapi.fetch(video_id)
except Exception as e:
    print(f"Error retrieving transcript: {e}")
    transcript = []  # Set to empty list to avoid further errors

# Concatenate the transcript into a single string
transcript_text = ""
for segment in transcript:
    transcript_text += segment["text"] + " "

[('en', 'English (auto-generated)')]


In [5]:
print(transcript_text)

[Music] good morning everybody and welcome back to exploring aws now this morning what i want to talk about is regions and availability zones so there's a lot more to amazon's infrastructure than just those two things but those are kind of the two important things that you're really going to want to know when you're thinking about deploying things to the cloud so let me just grab my pen here really really quick and we'll just kind of draw a quick quick whiteboard session here before we take a look at it so amazon has multiple regions across the globe now these regions are not necessarily a specific data center what they've done is they've said okay we're going to pick a geographical location where we can have a cluster of data centers so let's just say that this region is northern virginia okay in northern virginia there may be multiple data centers that are spread across the northern virginia geographical area and each one of these data centers would be a separate availability zone so

# Summarizing and Translating a Transcript Using ML Models

#### Now that we have the full transcript of the YouTube video, we can proceed to utilize open-source models for natural language processing tasks, such as summarization, translation, and more. These models will help us to extract valuable insights from the transcript.

#### We will be using the Transformers library from Hugging Face ü§ó. By using pretrained models, we can significantly reduce our compute costs and carbon footprint - and we can save valuable time and resources that would otherwise be required to train a model from scratch.

#### Let's assume that English is not our first language, and we would like to translate the YouTube transcript to Spanish. To achieve this, we can utilize a pretrained machine learning model specifically designed for translation. Translation involves converting a sequence of text from one language to another. It is a task that can be formulated as a sequence-to-sequence problem. By leveraging a pretrained sequence-to-sequence translation model, we can effectively translate the YouTube transcript from English to Spanish.

In [6]:
# Define the maximum sequence length
max_length = 512

# Replace this with your own checkpoint
model_checkpoint = "Helsinki-NLP/opus-mt-en-es"
translator = pipeline("translation", model=model_checkpoint)
# model_checkpoint = "google-t5/t5-small"
# translator = pipeline("translation_es_to_en", model=model_checkpoint,max_length=max_length)


# Split the input text into smaller segments
segments = [transcript_text[i:i+max_length] for i in range(0, len(transcript_text), max_length)]

# Translate each segment and concatenate the results
translated_text = ""
for segment in segments:
    result = translator(segment)
    translated_text += result[0]['translation_text']

Device set to use cuda:0
You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


In [7]:
print(translated_text)

[M√∫sica] Buenos d√≠as a todos y bienvenidos de nuevo a explorar aws ahora esta ma√±ana de lo que quiero hablar es de regiones y zonas de disponibilidad as√≠ que hay mucho m√°s para la infraestructura de Amazon que s√≥lo esas dos cosas, pero esas son una especie de las dos cosas importantes que realmente vas a querer saber cuando est√°s pensando en desplegar cosas en la nube, as√≠ que perm√≠tanme tomar mi pluma aqu√≠ realmente muy r√°pido y vamos a dibujar una r√°pida sesi√≥n de pizarra blanca aqu√≠ antes de echar un vistazo as√≠ queamazon tiene m√∫ltiples regiones en todo el mundo ahora estas regiones no son necesariamente un centro de datos espec√≠fico lo que han hecho es que han dicho bien vamos a elegir una ubicaci√≥n geogr√°fica donde podemos tener un grupo de centros de datos as√≠ que vamos a decir que esta regi√≥n es la virginia del norte bien en la virginia del norte puede haber m√∫ltiples centros de datos que se extienden por el √°rea geogr√°fica de la virginia del norte y cad

#### Next, we will proceed with summarizing the video using a pretrained model for text summarization. In this case, we will be using the original transcript in English. However, if one choses to continue with the translated transcript, one can replace the 'transcript_text' variable with the 'translated_text' variable that contains the translated text. By applying the summarization model to the transcript, we can generate a concise summary of the video's content.

In [8]:
# Instantiate the tokenizer and the summarization pipeline
tokenizer = AutoTokenizer.from_pretrained('stevhliu/my_awesome_billsum_model')
summarizer_stevhliu = pipeline("summarization", model='stevhliu/my_awesome_billsum_model', tokenizer=tokenizer)

Device set to use cuda:0


In [9]:
# Instantiate the tokenizer and the summarization pipeline
tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-cnn')
summarizer_facebook = pipeline("summarization", model='facebook/bart-large-cnn', tokenizer=tokenizer)

Device set to use cuda:0


In [10]:
# Define chunk size in number of words
chunk_size = 200 # one may need to adjust this value depending on the average length of your words

# Split the text into chunks
words = transcript_text.split()
chunks = [' '.join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]

#### In the below 2 blocks, we will be using 2 different pre-trained models for summarization activity:


1.  stevhliu/my_awesome_billsum_model
2.  facebook/bart-large-cnn

We have already defined the 'Tokenizer' and summarisation pipeline for both of these models earlier.



In [11]:
# Summarize each chunk using stevhliu/my_awesome_billsum_model
summaries = []
for chunk in chunks:
    # Summarize the chunk
    summary = summarizer_stevhliu(chunk, max_length=100, min_length=30, do_sample=False)

    # Extract the summary text
    summary_text = summary[0]['summary_text']

    # Add the summary to our list of summaries
    summaries.append(summary_text)

# Join the summaries back together into a single summary
final_summary_stevhliu = ' '.join(summaries)

In [12]:
# Summarize each chunk using facebook/bart-large-cnn
summaries = []
for chunk in chunks:
    # Summarize the chunk
    summary = summarizer_facebook(chunk, max_length=100, min_length=30, do_sample=False)

    # Extract the summary text
    summary_text = summary[0]['summary_text']

    # Add the summary to our list of summaries
    summaries.append(summary_text)

# Join the summaries back together into a single summary
final_summary_facebook = ' '.join(summaries)

#### We can also perform summarization on the translated text

In [13]:
# Performing summarization using facebook/bart-large-cnn, on the translated spanish text
# Split the text into chunks
words = translated_text.split()
chunks = [' '.join(words[i:i+chunk_size]) for i in range(0, len(words), chunk_size)]

# Summarize each chunk
summaries = []
for chunk in chunks:
    # Summarize the chunk
    summary = summarizer_facebook(chunk, max_length=100, min_length=30, do_sample=False)

    # Extract the summary text
    summary_text = summary[0]['summary_text']

    # Add the summary to our list of summaries
    summaries.append(summary_text)

# Join the summaries back together into a single summary
final_summary_facebook_es = ' '.join(summaries)

In [14]:
print(final_summary_stevhliu)

amazon has multiple regions across the globe so let's just say that this region is northern virginia okay there may be multiple data centers that are spread across the northern virginians geographical area and each one of these data centers would be a separate availability zone so when you see availability zone or az that's equivalent to a data center living within that region within that area so amazon may be able to have a cluster of data centers. if you head over to aws.amazon.com you can see the url here and you scroll down about three quarters of the way you're gonna see a map now you can click load more that'll bring you to another page you will have the same map on it with just a little bit more of a breakdown of different different things right at different different categories and what the different options are but that's not what i web server in another data center i have an efs share in each data center so this is how you can figure out where and how you want to spread your 

In [15]:
print(final_summary_facebook)

There's a lot more to amazon's infrastructure than just regions and availability zones. There may be multiple data centers that are spread across the northern virginia geographical area. Each one of these data centers would be a separate availability zone. If you head over to aws.amazon.com you can see the url here and you scroll down about three quarters of the way you're gonna see a map now you can click load more that'll bring you to another page. There are six availability zones or six data centers within northern virginia within that region that i could use. You can use this map to figure out where and what region you're going to be able to put your data in and also how many data centers do you have as an option. For example in northern virginia i have six availability zones but over in ohio i only have three so maybe i want to pick northern virginian u.s east as my availability zone. Amazon is in the process of building out data centers within that region now if i come over here 

In [16]:
print(final_summary_facebook_es)

Amazon tiene m√∫ltiples regiones en todo el mundo ahora. Vamos a dibujar una r√°pida sesi√≥n de pizarra blanca aqu√≠ antes de echar un vistazo. Geogr√°ficamente hay seis zonas de disponibilidad o seis centros de datos dentro de Virginia del Norte dentro of esa regi√≥n. ahora puedo distribuir mis cargas de trabajo a trav√©s of los seis¬†centros of datos. The mapa. of esta demo se extiende a trav√©s of dos zonas de disponibilidad diferentes o dos centros de datos diferencees. Puedes averiguar d√≥nde y c√≥mo quieres difundir tus datos. Aqu√≠ es donde realmente vas a empezar a tomar algunas decisiones sobre d√≥nde est√°n sus datos. estas peque√±as burbujas naranjas aqu√≠ van a ser centros of datos o regiones that vienen pronto. Amazon est√° en el proceso de construir centros de datos a um. voy a hacer clic en Amazon ec2 para Una regi√≥n es una ubicaci√≥n geogr√°fica y usted tendr√° una Regi√≥n aqu√≠ as√≠ dentro of la mismaRegi√≥n usted puede tener u.s. this uno u.S uh usted sabe that esto 

#### Below we try to translate back the summarized spanish text to english

In [17]:
# Translating the spanish summary to english
model_checkpoint = "Helsinki-NLP/opus-mt-es-en"
translator = pipeline("translation", model=model_checkpoint)


# Split the input text into smaller segments
segments = [final_summary_facebook_es[i:i+max_length] for i in range(0, len(final_summary_facebook_es), max_length)]

# Translate each segment and concatenate the results
translated_text = ""
for segment in segments:
    result = translator(segment)
    translated_text += result[0]['translation_text']

print(translated_text)

Device set to use cuda:0


Amazon has multiple regions around the world now. Let‚Äôs draw a quick whiteboard session here before taking a look. Geographically there are six areas of availability or six data centers within North Virginia within that region. Now I can distribute my workloads through the six data centers. The map. of this demo extends through two different availability zones or two different data centers. You can find out where and how you want diffu.ndir tus datos. This is where you're really going to start making some decisions about where your data is. These little orange bubbles here are going to be data centers or regions that are coming soon. Amazon is in the process of building data centers to um. I'm going to click Amazon ec2 for A region is a geographic location and you will have a region here as well within the sameRegion you can have u.s. this one u.S uh you know that this can be eu west one so I can have diregions and then Amazon diff you have broken your infrastructure and again I enco

## We were able to get a concise summary of the video's content, excluding any sponsorships, advertisements, or other extraneous information. This enables us to quickly grasp the key points and main ideas from the video without being slowed down by unnecessary details.

## Let us now proceed to the next step, where we will re-generate a summary to compare results from OpenAI vs an open-source model, as well as create a step-by-step tutorial based on the summarized transcript and a quiz to test our understanding and gained knowledge.

# Extracting Steps and Creating a Quiz Using ChatGPT APIs

#### Let's obtain a video summary using the ChatGPT model and compare it to the summary we obtained in the previous step using open-source models.

In [1]:
#OPENAIKEY = User's OpenAI API Key. Get an API Key here: https://platform.openai.com/settings/organization/api-keys

In [20]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()
OPENAIKEY = user_secrets.get_secret("OPENAIKEY")

In [21]:
#Defining a function to generate chunks of text from the input transcript
def split_text_into_chunks(text, max_chunk_size):
    return textwrap.wrap(text, max_chunk_size)

client = OpenAI(api_key=OPENAIKEY #fetching OpenAI API Key from Kaggle Secrets
max_chunk_size = 4000 # defining size of chunk

transcript_chunks = split_text_into_chunks(transcript_text, max_chunk_size)
summaries = ""

for chunk in transcript_chunks:
    response = client.chat.completions.create(
                                              model="gpt-4o-mini",
                                              messages=[
                                                        {"role": "system", "content": "You are a helpful assistant."},
                                                        {"role": "user", "content": f"{chunk}\n\nCreate short concise summary"}
                                                      ],
                                              max_tokens=250,
                                              temperature=0.5
                                            )

    summaries += response.choices[0].message.content.strip() + " "

print("Summary: \n")
print(summaries)

Summary: 

In this session on AWS infrastructure, the focus is on understanding regions and availability zones (AZs). AWS has multiple global regions, each containing several data centers, with each data center representing an availability zone. For example, the Northern Virginia region has six availability zones, allowing users to distribute workloads for redundancy and performance. Users can check the AWS website for a map of regions and availability zones to decide where to deploy their data based on geographical proximity and the number of available zones. The discussion also highlights the importance of documentation for further guidance on using AWS services, specifically referencing Amazon EC2 for compute resources. In this video, the speaker emphasizes the importance of understanding AWS regions and availability zones for cloud practitioners. A region is a geographical area, such as "US East 1" or "EU West 1," which contains multiple availability zones. The speaker encourages v

## Let's proceed by modifying the prompts and instructing ChatGPT to extract the necessary steps from the video transcript.
#### By doing so, we can generate a step-by-step guide that provides clear instructions for us to follow along. This will help us to have a structured, guided approach while engaging with the video content.

In [22]:
response = client.chat.completions.create(
                                          model="gpt-4o-mini",
                                          messages=[
                                                    {"role": "system", "content": "You are a technical instructor."},
                                                    {"role": "user", "content": transcript_text},
                                                    {"role": "user", "content": "Generate steps to follow from text."},
                                                  ]
                                        )

# The assistant's reply
guide= response.choices[0].message.content

print("Steps:")
print(guide)

Steps:
Certainly! Here are the steps to follow based on the provided text about AWS regions and availability zones:

1. **Understand Regions and Availability Zones**:
   - Learn the definitions: A region is a geographical location with multiple data centers, while availability zones (AZs) are individual data centers within those regions.

2. **Explore AWS Regions**:
   - Visit the AWS website (aws.amazon.com).
   - Navigate to the section showcasing AWS infrastructure.
   - Find the map illustrating the regions and availability zones.

3. **Identify Specific Regions**:
   - Decide on a region based on your needs (e.g., geographical proximity to users or number of availability zones).
   - Example: Choose Northern Virginia (us-east-1) as it has six availability zones compared to Ohio, which has three.

4. **Utilize Availability Zones for Load Distribution**:
   - Consider spreading your workloads across multiple availability zones for better redundancy and availability.
   - Example: De

## Let‚Äôs generate a quiz based on the materials covered in the video.
#### The quiz will assess our understanding of the content. You will see a quiz with 10 question generated to test your knowledge. This can be especially helpful if you are preparing for exams. You can modify a prompt to explain the right answers - for example: "Generate 10 quiz questions based on the text with multiple choices and explain why a particular answer is the right one".

In [23]:
response = client.chat.completions.create(
                                          model="gpt-4o-mini",
                                          messages=[
                                                    {"role": "system", "content": "You are a helpful assistant that generates questions."},
                                                    {"role": "user", "content": transcript_text},
                                                    {"role": "user", "content": "Generate 10 quiz questions based on the text with multiple choices."},
                                                  ]
                                        )

# The assistant's reply
quiz_questions = response.choices[0].message.content

print("Quiz Questions:")
print(quiz_questions)

Quiz Questions:
Sure! Here are 10 quiz questions based on the text, each with multiple-choice answers:

1. **What are the two key components of Amazon's infrastructure discussed in the text?**
   - A) Regions and Instances
   - B) Regions and Availability Zones
   - C) Data Centers and Availability Zones
   - D) EC2 Instances and Regions

   **Answer:** B) Regions and Availability Zones

2. **How is a region defined in the context of Amazon's infrastructure?**
   - A) A single data center
   - B) A geographical location containing multiple data centers
   - C) An account holder's location
   - D) A specific service offered by AWS

   **Answer:** B) A geographical location containing multiple data centers

3. **What does the acronym "AZ" stand for?**
   - A) Amazon Zone
   - B) Availability Zone
   - C) Active Zone
   - D) Azure Zone

   **Answer:** B) Availability Zone

4. **How many availability zones does the northern Virginia region have, according to the text?**
   - A) Three
   - 