# Youtube Video Transcript Summarisation

**Task:** Exctract and Summarise the trascript of a Youtube video

**Solution:** to summarise a video transcript, I must do the following steps: 

**Step 1: Get the video transcript**
Here I have two options: 
-Youtube Data API: if it is an official task from an official project, I'd better use Youtube Data API if human-made captions are available for the video. The pros of this is that the offical API is supported by Youtube, meaning that the risk of breaking because Youtube changed something, is minimal. It can also be useful for fetching meta-data. 
-youtube_transcript_api: a library in Python that can extract the captions even if they are auto-generated. I won't have to authenticate but it is possible that the code can break if Youtube changes something. 

Since this is not an official project, I don't need meta-data and I don't plan to use it intensively for future needs, I will use the youtube_transcript_api. I prefer to do it quickly and to not deal with authentication. 

**Step 2: Summarise the transcript with a suitable LLM from Hugging Face**
Then I need to pass the video transcript in a suitable LLM for summarisation but I may also have to do some preprocessing of the transcript. 

#### First model: BART and chuncking

In [1]:
from youtube_transcript_api import YouTubeTranscriptApi
from transformers import pipeline

def get_youtube_transcript(video_id):
    try:
        transcript = YouTubeTranscriptApi.get_transcript(video_id)
        text = " ".join([entry['text'] for entry in transcript])
        return text
    except Exception as e:
        return f"Error: {str(e)}"

def summarize_text(text, max_words=180):
    summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
    
    chunk_size = 1024  
    chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
    
    summary = []
    for chunk in chunks:
        summarized_chunk = summarizer(chunk, max_length=max_words, min_length=50, do_sample=False)
        summary.append(summarized_chunk[0]['summary_text'])

    return " ".join(summary)

# Example Usage
video_id = "grtmiWbmvv0" #Video ID goes here
transcript = get_youtube_transcript(video_id)

if "Error" not in transcript:
    summary = summarize_text(transcript)
    print("\nSummary:\n", summary)
else:
    print(transcript)

Device set to use mps:0
Your max_length is set to 180, but your input_length is only 9. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=4)



Summary:
  creating consistent AI characters is now easier than ever. I've just upgraded my free consistent character workflows based on your feedback making them easier to install and adding so many new cool features. I'm calling this one the consistent Character Creator and here's how it works. This workflow combines several powerful AI tools and models to create this character sheet. At its heart is a tool called multiv view adapter which uses sdxl to generate consistent 3D views of your character. I'll show you how to set everything up step by step at the end of this video. The workflow comes in three versions flux sdxl, sxl light and sXl light which is suitable for lower vram environments. It allows you to create a wide variety of styles and it's also a very good balance between speed and quality. Once you have everything set up you can just drag and drop the workflow file into the com UI interface. The workflow is designed to make it easier for players to create their own charac

It works well. Here and there there are some mistakes or repetitions but it is normal since the model is not very big. Indeed it has summarised the key points in the transcript, keeping some of the original wording, which is normal for the BART model. 

It gave the same warning messages as in the [Scraping website task](https://github.com/galiakraicheva/nlp_tasks/blob/main/websummarisation/bart_websummarisation.ipynb). However, I this time, I just shortened the max_lenght and it didn't change the summary a lot. 

After trying with more examples, the code seems to summarise relatively long transcripts well.  

**A side note:** 

A thing to improve in this script is the way you reference the video. Now I am using video ID. But where do we find our video ID? It is in the url of the video on Youtube. 

https://www.youtube.com/watch?v=dQw4w9WgXcQ

It is 11 characters long and is after ?v=. 

However, it would be very helpful to do it automatically: to paste the video url and let Python figure out by itself what the video ID is. 

#### Testing the model on short lyrics:

The model summarises well tutorials and longer transcripts. However, we should test also for very short and repetitive transcripts like song lyrics, audio-poems, etc. First, I test with: 

#### Second model: BART, chuncking and automated video ID retrival

In [2]:
from youtube_transcript_api import YouTubeTranscriptApi
from transformers import pipeline
import re

def get_youtube_transcript(video_id):
    try:
        transcript = YouTubeTranscriptApi.get_transcript(video_id)
        text = " ".join([entry['text'] for entry in transcript])
        return text
    except Exception as e:
        return f"Error: {str(e)}"

def summarize_text(text, max_words=180):
    summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
    
    chunk_size = 1024  
    chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
    
    summary = []
    for chunk in chunks:
        summarized_chunk = summarizer(chunk, max_length=max_words, min_length=50, do_sample=False)
        summary.append(summarized_chunk[0]['summary_text'])

    return " ".join(summary)

def extract_video_id(url):
    pattern = r"(?:v=|\/embed\/|\/shorts\/|\/watch\?v=|youtu\.be\/|\/v\/|\/e\/|\/\?v=|\/\?feature=player_embedded&v=|&v=|\/\#\/watch\?v=)([a-zA-Z0-9_-]{11})"
    match = re.search(pattern, url)
    return match.group(1) if match else None

# Example Usage
youtube_url = "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
video_id = extract_video_id(youtube_url)
transcript = get_youtube_transcript(video_id)

if "Error" not in transcript:
    summary = summarize_text(transcript)
    print("\nSummary:\n", summary)
else:
    print(transcript)


Device set to use mps:0



Summary:
 "We're no strangers to love you know the rules and so do I I full commitments while I'm thinking of you wouldn't get this from any other guy" "We've known each other for so long your heart's been aching but your to sh to say it inside" "If you ask me how I'm feeling don't tell me you're too my" "I just want to tell you how I'm feeling got to make you understand" "Your heart's been aching but you're too sh to say inside" "We both know what's been going on we the game and we're going to play it"


What is obvious from the output is that it is very different from the one before. Before, it was long and there were no broken sentences. Now, it is very short and there are broken sentences. The lenght difference is normal since the first video is a tutorial, whereas the second video is a song and the lyrics are much shorter. Looking at the song lyrics I see lots of repetitions. So it is good to make the code robust to repetitions and cut sentences. To do that: 

1) Removing repetitive lines because the model may find it hard to identify what to remove and generates unnatural breaks. If I remove the repetitive lyrics, I will help the model focus on meaning.
2) Split by full sentences instead of raw characters so there won't be any words split in half and incomplete chunks.

#### Third model: BART, full sentences and removed repetitions

In [3]:
from youtube_transcript_api import YouTubeTranscriptApi
from transformers import pipeline
import re
import nltk
nltk.download("punkt")
from nltk.tokenize import sent_tokenize

def get_youtube_transcript(video_id):
    try:
        transcript = YouTubeTranscriptApi.get_transcript(video_id)
        text = " ".join([entry['text'] for entry in transcript])
        return text
    except Exception as e:
        return f"Error: {str(e)}"

def remove_redundant_lines(text):
    """ Remove duplicate lines in song lyrics to improve summarization """
    lines = text.split("\n")
    unique_lines = list(dict.fromkeys(lines))
    return " ".join(unique_lines)

def chunk_sentences(text, max_length=1024):
    """ Split text into chunks by sentences, not raw character count """
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = ""

    for sentence in sentences:
        if len(current_chunk) + len(sentence) <= max_length:
            current_chunk += " " + sentence
        else:
            chunks.append(current_chunk.strip())
            current_chunk = sentence  # Start new chunk

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

def summarize_text(text, max_words=180):
    summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
    
    text = remove_redundant_lines(text)  # Remove repeated lines
    chunks = chunk_sentences(text)  # Split by sentence

    summary = []
    for chunk in chunks:
        summarized_chunk = summarizer(chunk, max_length=max_words, min_length=50, do_sample=False)
        summary.append(summarized_chunk[0]['summary_text'])

    return " ".join(summary)

def extract_video_id(url):
    pattern = r"(?:v=|\/embed\/|\/shorts\/|\/watch\?v=|youtu\.be\/|\/v\/|\/e\/|\/\?v=|\/\?feature=player_embedded&v=|&v=|\/\#\/watch\?v=)([a-zA-Z0-9_-]{11})"
    match = re.search(pattern, url)
    return match.group(1) if match else None

# Example Usage
youtube_url = "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
video_id = extract_video_id(youtube_url)
transcript = get_youtube_transcript(video_id)

if "Error" not in transcript:
    summary = summarize_text(transcript)
    print("\nSummary:\n", summary)
else:
    print(transcript)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/macbookair/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Device set to use mps:0
Your max_length is set to 180, but your input_length is only 3. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=1)



Summary:
 CNN.com will feature iReporter photos in a weekly Travel Snapshots gallery. Please submit your best shots of the U.S. for next week. Visit CNN.com/Travel next Wednesday for a new gallery of snapshots. Please share your best photos of the United States with CNN iReport. "We've known each other for so long your heart's been aching but you're too sh to say inside" "We're no strangers to love you know the rules and so do I I full commitments while I'm thinking of you wouldn't get this from any other guy" "I just want to tell you how I'm feeling got to make you understand Never Going To Give You Up"


Instead of improving, the model failed miserably by including some CNN data. This is because when removing the repetitive lyrics, the input text of the model became really little and BART, trained originally on CNN data hallucinates some of its training data. It is a good idea to try another model that handles short data better (or alternatively leave the removed duplicates and just fix the text spliting)

#### Fourth model: trying t-5-base 

In [4]:
from youtube_transcript_api import YouTubeTranscriptApi
from transformers import pipeline
import re
import nltk
nltk.download("punkt")
from nltk.tokenize import sent_tokenize

def get_youtube_transcript(video_id):
    try:
        transcript = YouTubeTranscriptApi.get_transcript(video_id)
        text = " ".join([entry['text'] for entry in transcript])
        return text
    except Exception as e:
        return f"Error: {str(e)}"

def remove_redundant_lines(text):
    """ Remove duplicate lines in song lyrics to improve summarization """
    lines = text.split("\n")
    unique_lines = list(dict.fromkeys(lines))
    return " ".join(unique_lines)

def chunk_sentences(text, max_length=1024):
    """ Split text into chunks by sentences, not raw character count """
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = ""

    for sentence in sentences:
        if len(current_chunk) + len(sentence) <= max_length:
            current_chunk += " " + sentence
        else:
            chunks.append(current_chunk.strip())
            current_chunk = sentence  # Start new chunk

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

def summarize_text(text, max_words=80):
    summarizer = pipeline("summarization", model="t5-base")
    
    text = remove_redundant_lines(text)  # Remove repeated lines
    chunks = chunk_sentences(text)  # Split by sentence

    summary = []
    for chunk in chunks:
        summarized_chunk = summarizer(chunk, max_length=max_words, min_length=10, do_sample=False)
        summary.append(summarized_chunk[0]['summary_text'])

    return " ".join(summary)

def extract_video_id(url):
    pattern = r"(?:v=|\/embed\/|\/shorts\/|\/watch\?v=|youtu\.be\/|\/v\/|\/e\/|\/\?v=|\/\?feature=player_embedded&v=|&v=|\/\#\/watch\?v=)([a-zA-Z0-9_-]{11})"
    match = re.search(pattern, url)
    return match.group(1) if match else None

# Example Usage
youtube_url = "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
video_id = extract_video_id(youtube_url)
transcript = get_youtube_transcript(video_id)

if "Error" not in transcript:
    summary = summarize_text(transcript)
    print("\nSummary:\n", summary)
else:
    print(transcript)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/macbookair/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Device set to use mps:0
Your max_length is set to 80, but your input_length is only 3. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=1)



Summary:
 . - (EN) - we've known each other for so long your heart's been aching but your to sh to say it inside we both know what's going on we know the game and we're going to play it . if you ask me how I'm feeling don't tell me you're too my you see Never Going To Give You Up never going to let you


Here I have used a different model and I have decreased the length of the summary but it is good to test also to summarise the summary to prevent cutting the summary. 

#### Fifth model: recursive summarisation

In [5]:
from youtube_transcript_api import YouTubeTranscriptApi
from transformers import pipeline
import re
import nltk
nltk.download("punkt")
from nltk.tokenize import sent_tokenize

def get_youtube_transcript(video_id):
    try:
        transcript = YouTubeTranscriptApi.get_transcript(video_id)
        text = " ".join([entry['text'] for entry in transcript])
        return text
    except Exception as e:
        return f"Error: {str(e)}"

def remove_redundant_lines(text):
    """ Remove duplicate lines in song lyrics to improve summarization """
    lines = text.split("\n")
    unique_lines = list(dict.fromkeys(lines))
    return " ".join(unique_lines)

def chunk_sentences(text, max_length=1024):
    """ Split text into chunks by sentences, not raw character count """
    sentences = sent_tokenize(text)
    chunks = []
    current_chunk = ""

    for sentence in sentences:
        if len(current_chunk) + len(sentence) <= max_length:
            current_chunk += " " + sentence
        else:
            chunks.append(current_chunk.strip())
            current_chunk = sentence  # Start new chunk

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

def summarize_text(text, max_words=120):
    summarizer = pipeline("summarization", model="t5-base")
    
    text = remove_redundant_lines(text)  # Remove repeated lines
    chunks = chunk_sentences(text)  # Split by sentence

    summary = []
    for chunk in chunks:
        summarized_chunk = summarizer(chunk, max_length=max_words, min_length=10, do_sample=False)
        summary.append(summarized_chunk[0]['summary_text'])

    return " ".join(summary)

# Recursive Shortening
def recursive_summarization(text, max_words=100):
    first_summary = summarize_text(text, max_words=max_words)  
    shorter_summary = summarize_text(first_summary, max_words=max_words//2)  
    return shorter_summary

def extract_video_id(url):
    pattern = r"(?:v=|\/embed\/|\/shorts\/|\/watch\?v=|youtu\.be\/|\/v\/|\/e\/|\/\?v=|\/\?feature=player_embedded&v=|&v=|\/\#\/watch\?v=)([a-zA-Z0-9_-]{11})"
    match = re.search(pattern, url)
    return match.group(1) if match else None

# Example Usage
youtube_url = "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
video_id = extract_video_id(youtube_url)
transcript = get_youtube_transcript(video_id)

if "Error" not in transcript:
    recursive_summarization(transcript, max_words=120)
    print("\nSummary:\n", summary)
else:
    print(transcript)

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/macbookair/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Device set to use mps:0
Your max_length is set to 120, but your input_length is only 3. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=1)
Device set to use mps:0



Summary:
 . - (EN) - we've known each other for so long your heart's been aching but your to sh to say it inside we both know what's going on we know the game and we're going to play it . if you ask me how I'm feeling don't tell me you're too my you see Never Going To Give You Up never going to let you


#### Conclusion: 

**For longer text:** the first model seemed to work well. 

**For shorter text:** the second model performed the best so far. Removing repetitions doesn't seem urgent because they don't show up in the summary and if I do, the text becomes really short which messes up the summarisation. As further steps, it can be helpful to try stronger models on short text, like GPT-4. 

Since transcripts can be very different in topics, length, wording and vocabulary, it is a good idea, instead of looking for a universal model to summarise all transcripts, to fine-tune a model on specific types of transcripts. 