# Automatic video chaptering with LLMs and TF-IDF
 
- Accompanying Medium article: [Automatic video chaptering with LLMs and TF-IDF]()
- [Github repository](https://github.com/Yannael/automatic-video-chaptering)
- [Gradio demo on HuggingFace](https://huggingface.co/spaces/Yannael/video-chaptering)


# Install and import libraries

This notebook makes use of the following libraries:

- `youtube-transcript-api`: used to directly download the video transcript from Youtube
- `openai` and `groq`: used to interact with LLMs (Llama 3 8B or GPT-4o-mini)
- `gradio`: used to create a simple web interface to interact with the model

Optional libraries (for video downloading and speech-to-text Whisper model)

- `yt-dlp`: used to download the audio of a Youtube video
- `faster_whisper`: used to get transcript from audio


Install libraries with:

In [1]:
%%capture

!pip install -q youtube-transcript-api
!pip install -q openai
!pip install -q groq
!pip install -q gradio

!pip install -q yt_dlp
!pip install -q faster_whisper


Load libraries


In [1]:
import os
import json
import re 

from youtube_transcript_api import YouTubeTranscriptApi

from openai import OpenAI
from groq import Groq

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

import yt_dlp
from faster_whisper import WhisperModel
import torch

# Load your keys from .env file
from dotenv import load_dotenv
load_dotenv()
GROQ_API_KEY = os.getenv("GROQ_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
# Or add your keys here
# GROQ_API_KEY = "xxx"
# OPENAI_API_KEY = "xxx"

Define Youtube video ID, folder to store video, chapters, and resulting blog post.


In [2]:
# Usage example
video_id = 'ErnWZxJovaM' # MIT Introduction to Deep Learning | 6.S191 - Alexander Amini
#video_id = 'Unzc731iCUY' # How to speak - Patrick Winston - https://www.youtube.com/watch?v=Unzc731iCUY
#video_id = 'zduSFxRajkE' # Let's build the GPT Tokenizer - Andrej Karpathy

DATA_DIR = f"tmp/{video_id}"

if not os.path.exists(DATA_DIR):
    os.makedirs(DATA_DIR)


## 1) Get the video/audio transcript



### With YouTubeTranscriptApi

In [3]:
transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=["en"])
transcript = [{'start': s['start'], 'text': s['text']} for s in transcript]
transcript[0:4]

[{'start': 1.17, 'text': '[Music]'},
 {'start': 10.28, 'text': 'good afternoon everyone and welcome to'},
 {'start': 12.88, 'text': 'MIT sus1 191 my name is Alexander amini'},
 {'start': 16.84, 'text': "and I'll be one of your instructors for"}]

In [4]:
len(transcript)

1789

### With Whisper



#### Download audio

In [5]:
def download_audio(video_id, DOWNLOAD_DIR="temp_download"):

    os.makedirs(f"{DOWNLOAD_DIR}", exist_ok=True)
    os.makedirs(f"{DOWNLOAD_DIR}/{video_id}", exist_ok=True)

    audio_path = f"{DOWNLOAD_DIR}/{video_id}/{video_id}_audio.mp4"

    # Define options for yt-dlp
    ydl_opts = {
        'format': f'bestaudio',  # Select the best quality format
        'outtmpl': audio_path
    }

    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        video_url = 'https://www.youtube.com/watch?v=' + video_id
        ydl.download([video_url])


    return audio_path

# About 5 seconds for a one hour video (65MB of audio)
#%time 
path_to_audio=download_audio(video_id, DATA_DIR)

[youtube] Extracting URL: https://www.youtube.com/watch?v=xv7oV1QD3sE
[youtube] xv7oV1QD3sE: Downloading webpage
[youtube] xv7oV1QD3sE: Downloading ios player API JSON
[youtube] xv7oV1QD3sE: Downloading android player API JSON
[youtube] xv7oV1QD3sE: Downloading player bcd1f224


         n = kGAfTKXzUyorHLTMUm ; player = https://www.youtube.com/s/player/bcd1f224/player_ias.vflset/en_US/base.js
         n = U0aww6VcNcAgDLX1WW ; player = https://www.youtube.com/s/player/bcd1f224/player_ias.vflset/en_US/base.js


[youtube] xv7oV1QD3sE: Downloading m3u8 information
[info] xv7oV1QD3sE: Downloading 1 format(s): 140
[download] Destination: tmp/xv7oV1QD3sE/xv7oV1QD3sE/xv7oV1QD3sE_audio.mp4
[download] 100% of   41.05MiB in 00:00:01 at 28.54MiB/s    
[FixupM4a] Correcting container of "tmp/xv7oV1QD3sE/xv7oV1QD3sE/xv7oV1QD3sE_audio.mp4"


#### Transcribe with Whisper

In [None]:
whisper_model = WhisperModel("large-v3",
                              device="cuda" if torch.cuda.is_available() else "cpu",
                              compute_type="float16",
                            )

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


model.bin:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.39k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

vocabulary.json:   0%|          | 0.00/1.07M [00:00<?, ?B/s]

In [None]:
def speech_to_text(whisper_model, audio_file, initial_prompt="Use punctuation, like this.", language="en", segments=None):

        segments, transcript_info = whisper_model.transcribe(audio_file,  initial_prompt=initial_prompt, language=language)
        segments = list(segments)
        segments = [
            {
                "start": round(s.start,2),
                "duration": round(s.end-s.start,2),
                "text": s.text,
            }
            for s in segments
        ]

        return segments

In [None]:
#14 minutes for a 1h10 video on T4
%time transcript = speech_to_text(whisper_model, audio_path)


CPU times: user 14min 11s, sys: 14.7 s, total: 14min 25s
Wall time: 13min 48s


In [5]:
transcript[0:3]

[{'start': 1.17, 'text': '[Music]'},
 {'start': 10.28, 'text': 'good afternoon everyone and welcome to'},
 {'start': 12.88, 'text': 'MIT sus1 191 my name is Alexander amini'}]

In [6]:
with open(f"{DATA_DIR}/{video_id}_transcript.json", "w") as f:
        json.dump(transcript, f, indent=4)

## 2) Structure transcript in paragraphs

This stage improves the transcript's readability (using an LLM) by:
 
- adding punctuation
- removing verbal tics
- and adding appropriate line breaks

The addition of linebreaks allows to separate the transcript in paragraphs.


#### Concatenate text

We first start by concatenating the text of the transcript in order to send it to the LLM in chunks.

This removes the timestamp information, which will be added back later (with the help of TF-IDF, in stage 3 below).

In [7]:
def get_transcript_as_text(transcript):
    temp_list = [s['text'] for s in transcript]
    transcript_as_text = ' '.join(temp_list)

    return transcript_as_text

transcript_as_text = get_transcript_as_text(transcript)

print("Number of characters: "+str(len(transcript_as_text))+"\n")

#print("First 1000 characters: ")
transcript_as_text[0:1000]
#transcript_as_text

Number of characters: 66632



"[Music] good afternoon everyone and welcome to MIT sus1 191 my name is Alexander amini and I'll be one of your instructors for the course this year along with Ava and together we're really excited to welcome you to this really incredible course this is a very fast-paced and very uh intense one week that we're about to go through together right so we're going to cover the foundations of a also very fast-paced moving field and a field that has been rapidly changing over the past eight years that we have taught this course at MIT now over the past decade in fact even before we started teaching this course Ai and deep learning has really been revolutionizing so many different advances and so many different areas of science meth mathematics physics and and so on and not that long ago we were having new types of we were having challenges and problems that we did not think were necessarily solvable in our lifetimes that AI is now actually solving uh Beyond human performance today and each ye

#### Get LLM client

Groq llama3-8b-8192 is faster and cheaper than OpenAI gpt-4o-mini, but is slightly less accurate.

If using Groq, change chunk_size_format_transcript to 1500 for better results (otherwise part of the input may go missing).

In [19]:
# Uncomment below to use Groq
#llm_client_format_transcript = Groq(api_key=GROQ_API_KEY)
#llm_model_format_transcript = 'llama3-8b-8192'
#chunk_size_format_transcript = 1500

# Comment below to use GPT-4o-mini
llm_client_format_transcript = OpenAI(api_key=OPENAI_API_KEY)
llm_model_format_transcript= "gpt-4o-mini-2024-07-18"
chunk_size_format_transcript = 5000

Price list for different LLMs (as of September 2024):

In [20]:
price_token={'gpt-4o': {'input': 5/1000000, 'output': 15/1000000},
             'gpt-4o-2024-08-06': {'input': 2.5/1000000, 'output': 10/1000000},
             'gpt-4o-mini-2024-07-18': {'input': 0.15/1000000, 'output': 0.6/1000000},
             'llama3-8b-8192' : {'input': 0.05 / 1000000, 'output': 0.08 / 1000000},
             'llama3-70b-8192' : {'input': 0.59 / 1000000, 'output': 0.79 / 1000000},
             'claude-3-5-sonnet-20240620': {'input': 3/1000000, 'output': 15/1000000},
             'claude-3-haiku-20240307': {'input': 0.25/1000000, 'output': 1.25/1000000},
             }

#### Call LLM

The call_llm function sends a prompt to the LLM and returns the response. It also calculates the price based on the number of tokens used.

In [21]:
def call_llm(client, model, system_prompt, prompt,
             temperature=0, seed=42, response_format=None, max_tokens=4000):

    response = client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": system_prompt
            },
            {
                "role": "user",
                "content": prompt
            }
        ],
        model=model, 
        temperature=temperature,
        seed=seed,
        response_format=response_format,
        max_tokens=max_tokens
    )

    nb_input_tokens = response.usage.prompt_tokens
    nb_output_tokens = response.usage.completion_tokens
    price = nb_input_tokens * price_token[model]['input'] + nb_output_tokens * price_token[model]['output']

    print(f"input tokens: {nb_input_tokens}; output tokens: {nb_output_tokens}, price: {price}")

    response_content=response.choices[0].message.content

    return response_content, nb_input_tokens, nb_output_tokens, price


The system prompt for the transcript formatting task is as follows:

In [22]:
system_prompt_transcript_to_paragraphs = f"""

You are a helpful assistant.

Your task is to improve the user input's readability: add punctuation if needed and remove verbal tics, and structure the text in paragraphs separated with '\n\n'.

Keep the wording as faithful as possible to the original text. 

Put your answer within <answer></answer> tags.

"""


Test the LLM on a chunk of the transcript.

In [23]:
%%time
response_content, nb_input_tokens, nb_output_tokens, price = \
            call_llm(llm_client_format_transcript, llm_model_format_transcript,
                     system_prompt_transcript_to_paragraphs, transcript_as_text[0:chunk_size_format_transcript],
                     temperature=0, seed=42, response_format=None)

input tokens: 1007; output tokens: 905, price: 0.00069405
CPU times: user 9.66 ms, sys: 12.4 ms, total: 22.1 ms
Wall time: 8.43 s


In [24]:
print(response_content)


<answer>[Music] Good afternoon, everyone, and welcome to MIT's 6.S191. My name is Alexander Amini, and I'll be one of your instructors for the course this year, along with Ava. Together, we're really excited to welcome you to this incredible course.

This is a very fast-paced and intense one week that we're about to go through together. We will cover the foundations of a rapidly evolving field that has been changing significantly over the past eight years since we began teaching this course at MIT. In fact, over the past decade, even before we started teaching this course, AI and deep learning have been revolutionizing many different areas of science, mathematics, physics, and more.

Not long ago, we faced challenges and problems that we did not think were necessarily solvable in our lifetimes. However, AI is now solving these problems, often exceeding human performance. Each year that we teach this course, this lecture in particular is becoming harder to teach. For an introductory lev

#### Process the whole transcript

Split the transcript in chunks and process iteratively.

In [15]:
def transcript_to_paragraphs(transcript, llm_client, llm_model, chunk_size=5000, progress=None):

    transcript_as_text = ' '.join([s['text'] for s in transcript])

    paragraphs = []
    last_paragraph = ""

    total_nb_input_tokens, total_nb_output_tokens, total_price = 0, 0, 0
    
    nb_chunks = int(len(transcript_as_text) / chunk_size) + 1
    progress_i = 0
    print(f"Number of chunks: {nb_chunks}")

    #for i in range(0, 10000, chunk_size): 
    for i in range(0, len(transcript_as_text), chunk_size):
        
        print ("i is: "+str(i))
        
        chunk = last_paragraph + " " + transcript_as_text[i:i + chunk_size]
        
        if progress is not None:
            progress_i += 1
            progress(progress_i/nb_chunks, desc="Processing")
        
        found_edited_transcript = False
    
        while not found_edited_transcript:

            response_content, nb_input_tokens, nb_output_tokens, price = \
                call_llm(llm_client, llm_model, 
                     system_prompt = system_prompt_transcript_to_paragraphs, prompt = chunk,
                     temperature=0.2, seed=42, response_format=None)
        
            # Sometimes the model 'forgets' to close the <answer> tag
            if not "</answer>" in response_content:
                response_content += "</answer>"
                
            # Extract content from <edited_transcript> tags
            pattern = re.compile(r'<answer>(.*?)</answer>', re.DOTALL)
            response_content_edited =  pattern.findall(response_content)
            
            if len(response_content_edited) > 0:
                found_edited_transcript = True
                response_content_edited = response_content_edited[0]
            
            else:
                print("No edited transcript found. Trying again.")
                print(response_content[0:100])
                print(response_content[-100:])
                

        total_nb_input_tokens += nb_input_tokens
        total_nb_output_tokens += nb_output_tokens
        total_price += price
    
        paragraphs_chunk = response_content_edited.strip().split('\n\n')

        print('Found paragraphs:', len(paragraphs_chunk))
        last_paragraph = paragraphs_chunk[-1]

        paragraphs += paragraphs_chunk[:-1]

    paragraphs += [last_paragraph]

    paragraphs_dict = [{'paragraph_number': i, 'paragraph_text': paragraph} for i, paragraph in enumerate(paragraphs)]

    return paragraphs_dict, total_nb_input_tokens, total_nb_output_tokens, total_price


In [17]:
%%time
paragraphs, nb_input_tokens, nb_output_tokens, price = transcript_to_paragraphs(transcript, llm_client_format_transcript, llm_model_format_transcript, chunk_size=chunk_size_format_transcript)

Number of chunks: 45
i is: 0
input tokens: 359; output tokens: 284, price: 4.067e-05
Found paragraphs: 4
i is: 1500
input tokens: 373; output tokens: 337, price: 4.5610000000000005e-05
Found paragraphs: 2
i is: 3000
input tokens: 547; output tokens: 497, price: 6.711e-05
Found paragraphs: 4
i is: 4500
input tokens: 490; output tokens: 423, price: 5.834e-05
Found paragraphs: 4
i is: 6000
input tokens: 351; output tokens: 230, price: 3.5950000000000006e-05
Found paragraphs: 2
i is: 7500
input tokens: 455; output tokens: 337, price: 4.971e-05
Found paragraphs: 4
i is: 9000
input tokens: 414; output tokens: 350, price: 4.8700000000000005e-05
Found paragraphs: 3
i is: 10500
input tokens: 553; output tokens: 499, price: 6.757e-05
Found paragraphs: 4
i is: 12000
input tokens: 525; output tokens: 483, price: 6.489000000000001e-05
Found paragraphs: 3
i is: 13500
input tokens: 460; output tokens: 404, price: 5.5320000000000006e-05
Found paragraphs: 5
i is: 15000
input tokens: 371; output tokens:

InternalServerError: Error code: 503 - {'error': {'message': 'Service Unavailable', 'type': 'internal_server_error'}}

In [146]:
len(paragraphs)

352

Example of the first paragraphs

In [147]:
paragraphs[0:3]

[{'paragraph_number': 0, 'paragraph_text': 'Hi everyone. '},
 {'paragraph_number': 1,
  'paragraph_text': "In this video, I'd like us to cover the process of tokenization in large language models. Now, you see here that I have a set face, and that's because tokenization is my least favorite part of working with large language models. Unfortunately, it is necessary to understand in some detail because it is fairly hairy and gnarly. There are a lot of hidden foot guns to be aware of, and much of the oddness with large language models typically traces back to tokenization. "},
 {'paragraph_number': 2,
  'paragraph_text': 'So, what is tokenization? In my previous video, "Let\'s Build GPT from Scratch," we actually already did tokenization, but we did a very naive, simple version of it. When you go to the Google Colab for that video, you will see that we loaded our training set, which was the Shakespeare dataset. In the beginning, the Shakespeare dataset is just a large string in Python; it

Save the paragraphs in a JSON file

In [148]:
with open(f"{DATA_DIR}/{video_id}_paragraphs.json", "w") as f:
        json.dump(paragraphs, f, indent=4)

## 3) Infer paragraph timestamps 

Let us now add back the timestamps to the paragraphs, using TF-IDF to match the paragraphs to the transcript segments.

The transform_text_segments function takes a list of text segments and combines them into a list of segments, each containing a specified number of words. 

For example, given the five following text segments from the transcript:

```
transcript[0:5]
[{'start': 1.17, 'text': '[Music]'},
 {'start': 10.28, 'text': 'good afternoon everyone and welcome to'},
 {'start': 12.88, 'text': 'MIT sus1 191 my name is Alexander amini'},
 {'start': 16.84, 'text': "and I'll be one of your instructors for"},
 {'start': 18.32, 'text': 'the course this year along with Ava and'}]
```

Call the transform_text_segments function with a num_words parameter of 10 would combine these segments into the following segments:

```
['[Music] good afternoon everyone and welcome to MIT sus1 191',
 'good afternoon everyone and welcome to MIT sus1 191 my',
 "MIT sus1 191 my name is Alexander amini and I'll",
 "and I'll be one of your instructors for the course",
 'the course this year along with Ava and']
```

Each segment contains a maximum of 10 words, and the last segment contains the remaining words from the original segments. This will be useful for the next step, where we will match the paragraphs to the transcript segments using TF-IDF.

In [149]:
def transform_text_segments(text_segments, num_words=50):
    # Initialize variables
    transformed_segments = []
    current_index = 0
    num_segments = len(text_segments)

    for i in range(num_segments):

        current_index = i

        # Get the current segment's starting timestamp and text
        current_segment = text_segments[current_index]
        current_text = current_segment['text']

        # Initialize a list to hold the combined text
        combined_text = " ".join(current_text.split()[:num_words])
        number_words_collected = len(current_text.split())

        # Collect words from subsequent segments
        while number_words_collected < num_words and (current_index + 1) < num_segments:
            current_index += 1
            next_segment = text_segments[current_index]
            next_text = next_segment['text']
            next_words = next_text.split()

            # Append words from the next segment
            if number_words_collected + len(next_words) <= num_words:
                combined_text += ' ' + next_text
                number_words_collected += len(next_words)
            else:
                # Only append enough words to reach the num_words limit
                words_needed = num_words - number_words_collected
                combined_text += ' ' + ' '.join(next_words[:words_needed])
                number_words_collected = num_words

        # Append the combined segment to the result
        transformed_segments.append(combined_text)

    return transformed_segments



In [150]:
transcript[0:5]

[{'start': 0.04, 'text': "hi everyone so in this video I'd like us"},
 {'start': 2.04, 'text': 'to cover the process of tokenization in'},
 {'start': 4.08, 'text': 'large language models now you see here'},
 {'start': 6.44, 'text': "that I have a set face and that's"},
 {'start': 8.28, 'text': 'because uh tokenization is my least'}]

In [151]:
transform_text_segments(transcript[0:5], num_words=10)

["hi everyone so in this video I'd like us to",
 'to cover the process of tokenization in large language models',
 'large language models now you see here that I have',
 "that I have a set face and that's because uh",
 'because uh tokenization is my least']

The add_timestamps_to_paragraphs function takes the transcript and the paragraphs and add back the timestamps to the paragraphs. It uses TF-IDF to find the most similar segment in the transcript for each paragraph.

In [152]:
def add_timestamps_to_paragraphs(transcript, paragraphs, num_words=50):
    list_indices = []
    
    transcript_num_words = transform_text_segments(transcript, num_words=num_words)

    paragraphs_start_text = [{"start": p['paragraph_number'], "text": p['paragraph_text']} for p in paragraphs]
    paragraphs_num_words = transform_text_segments(paragraphs_start_text, num_words=num_words)
    
    # Create a TF-IDF vectorizer
    vectorizer = TfidfVectorizer().fit_transform(transcript_num_words + paragraphs_num_words)
    # Get the TF-IDF vectors for the transcript and the excerpt
    vectors = vectorizer.toarray()
    
    for i in range(len(paragraphs_num_words)):
        
        # Extract the TF-IDF vector for the paragraph
        paragraph_vector = vectors[len(transcript_num_words) + i]

        # Calculate the cosine similarity between the paragraph vector and each transcript chunk
        similarities = cosine_similarity(vectors[:len(transcript_num_words)], paragraph_vector.reshape(1, -1))
        # Find the index of the most similar chunk
        best_match_index = int(np.argmax(similarities))

        list_indices.append(best_match_index)

        paragraphs[i]['matched_index'] = best_match_index
        paragraphs[i]['matched_text'] = transcript[best_match_index]['text']
        paragraphs[i]['start_time'] = int(transcript[best_match_index]['start'])-2
        if paragraphs[i]['start_time'] < 0:
            paragraphs[i]['start_time'] = 0


    return paragraphs

In [153]:
%%time
paragraphs = add_timestamps_to_paragraphs(transcript, paragraphs, num_words=50)

CPU times: user 34.1 s, sys: 29.9 s, total: 1min 4s
Wall time: 9.6 s


Example of the first paragraphs where the timestamps have been added back (start_time, in seconds):

In [154]:
paragraphs[0:5]

[{'paragraph_number': 0,
  'paragraph_text': 'Hi everyone. ',
  'matched_index': 0,
  'matched_text': "hi everyone so in this video I'd like us",
  'start_time': 0},
 {'paragraph_number': 1,
  'paragraph_text': "In this video, I'd like us to cover the process of tokenization in large language models. Now, you see here that I have a set face, and that's because tokenization is my least favorite part of working with large language models. Unfortunately, it is necessary to understand in some detail because it is fairly hairy and gnarly. There are a lot of hidden foot guns to be aware of, and much of the oddness with large language models typically traces back to tokenization. ",
  'matched_index': 1,
  'matched_text': 'to cover the process of tokenization in',
  'start_time': 0},
 {'paragraph_number': 2,
  'paragraph_text': 'So, what is tokenization? In my previous video, "Let\'s Build GPT from Scratch," we actually already did tokenization, but we did a very naive, simple version of it. 

Save the paragraphs with timestamps in a JSON file

In [155]:
with open(f"{DATA_DIR}/{video_id}_paragraphs.json", "w") as f:
        json.dump(paragraphs, f, indent=4)

## 4) Generate table of content

The table of content is found by grouping consecutive paragraphs into chapters and identifying meaningful chapter titles.


In [156]:
paragraphs_number_text = [{'paragraph_number': p['paragraph_number'], 'paragraph_text': p['paragraph_text']} for p in paragraphs]
paragraphs_json_dump = json.dumps(paragraphs_number_text)

paragraphs_json_dump[0:1000]

'[{"paragraph_number": 0, "paragraph_text": "Hi everyone. "}, {"paragraph_number": 1, "paragraph_text": "In this video, I\'d like us to cover the process of tokenization in large language models. Now, you see here that I have a set face, and that\'s because tokenization is my least favorite part of working with large language models. Unfortunately, it is necessary to understand in some detail because it is fairly hairy and gnarly. There are a lot of hidden foot guns to be aware of, and much of the oddness with large language models typically traces back to tokenization. "}, {"paragraph_number": 2, "paragraph_text": "So, what is tokenization? In my previous video, \\"Let\'s Build GPT from Scratch,\\" we actually already did tokenization, but we did a very naive, simple version of it. When you go to the Google Colab for that video, you will see that we loaded our training set, which was the Shakespeare dataset. In the beginning, the Shakespeare dataset is just a large string in Python; i

In [157]:
system_prompt_paragraphs_to_toc = """

	You are a helpful assistant.

	You are given a transcript of a course in JSON format as a list of paragraphs, each containing 'paragraph_number' and 'paragraph_text' keys.

	Your task is to group consecutive paragraphs in chapters for the course and identify meaningful chapter titles.

	Here are the steps to follow:

1. Read the transcript carefully to understand its general structure and the main topics covered.
2. Look for clues that a new chapter is about to start. This could be a change of topic, a change of time or setting, the introduction of new themes or topics, or the speaker's explicit mention of a new part.
3. For each chapter, keep track of the paragraph number that starts the chapter and identify a meaningful chapter title.
4. Chapters should ideally be equally spaced throughout the transcript, and discuss a specific topic.

	Format your result in JSON, with a list dictionaries for chapters, with 'start_paragraph_number':integer and 'title':string as key:value.
	
	Example: 
    {"chapters": 
        [{"start_paragraph_number": 0, "title": "Introduction"}, 
         {"start_paragraph_number": 10, "title": "Chapter 1"}
        ]
    }

"""

We use GPT-4o-mini for this task, as it is more cost-effective than OpenAI's GPT-4o and generally provides good results for this specific task.

In [158]:
llm_client_get_toc = OpenAI(api_key=OPENAI_API_KEY)
llm_model_get_toc= "gpt-4o-mini-2024-07-18"
#llm_model_get_toc= "gpt-4o-2024-08-06"
chunk_size_toc = 30

In [159]:
paragraphs_number_text = [{'paragraph_number': p['paragraph_number'], 'paragraph_text': p['paragraph_text']} for p in paragraphs]
chunk_json_dump = json.dumps(paragraphs_number_text[0:chunk_size_toc])

response, _, _, _ = call_llm(llm_client_get_toc, llm_model_get_toc, \
                    system_prompt_paragraphs_to_toc, chunk_json_dump, \
                    temperature=0, seed=42, response_format={"type": "json_object"})


input tokens: 3145; output tokens: 140, price: 0.00055575


In [160]:
print(response)

{
  "chapters": [
    {
      "start_paragraph_number": 0,
      "title": "Introduction to Tokenization"
    },
    {
      "start_paragraph_number": 5,
      "title": "Byte Pair Encoding and Advanced Tokenization"
    },
    {
      "start_paragraph_number": 10,
      "title": "Challenges and Complexities of Tokenization"
    },
    {
      "start_paragraph_number": 21,
      "title": "Tokenization in Non-English Languages"
    },
    {
      "start_paragraph_number": 26,
      "title": "Tokenization Issues in Programming Languages"
    }
  ]
}


We generate the TOC sequentially on chunks of paragraphs as it generally provides better results.

In [161]:
def paragraphs_to_toc(paragraphs, llm_client, llm_model, chunk_size=100):

    chapters = []
    number_last_chapter = 0

    total_nb_input_tokens, total_nb_output_tokens, total_price = 0, 0, 0

    while number_last_chapter < len(paragraphs):

        print(number_last_chapter)

        chunk = paragraphs[number_last_chapter:(number_last_chapter + chunk_size)]
        chunk = [{'paragraph_number': p['paragraph_number'], 'paragraph_text': p['paragraph_text']} for p in chunk]

        chunk_json_dump = json.dumps(chunk)

        content, nb_input_tokens, nb_output_tokens, price = call_llm(\
                llm_client, llm_model, \
                system_prompt_paragraphs_to_toc, chunk_json_dump, \
                temperature=0, seed=42, response_format={"type": "json_object"})

        total_nb_input_tokens += nb_input_tokens
        total_nb_output_tokens += nb_output_tokens
        
        chapters_chunk = json.loads(content)['chapters']
        
        if number_last_chapter == chapters_chunk[-1]['start_paragraph_number']:
            break

        chapters += chapters_chunk[:-1]
        
        number_last_chapter = chapters_chunk[-1]['start_paragraph_number']
        if number_last_chapter >= len(paragraphs)-5:
            break
        
    total_price = (total_nb_input_tokens * price_token[llm_model]['input'] + 
                   total_nb_output_tokens * price_token[llm_model]['output'])
    
    chapters += [chapters_chunk[-1]]

    return chapters, total_nb_input_tokens, total_nb_output_tokens, total_price

In [162]:
table_of_content, total_nb_input_tokens, total_nb_output_tokens, total_price = \
    paragraphs_to_toc(paragraphs, llm_client_get_toc, llm_model_get_toc, chunk_size=chunk_size_toc)

0
input tokens: 3145; output tokens: 140, price: 0.00055575
26
input tokens: 2650; output tokens: 96, price: 0.00045509999999999995
46
input tokens: 3024; output tokens: 84, price: 0.000504
68
input tokens: 2807; output tokens: 92, price: 0.00047624999999999995
90
input tokens: 2812; output tokens: 84, price: 0.0004722
114
input tokens: 2992; output tokens: 60, price: 0.00048479999999999997
125
input tokens: 2454; output tokens: 84, price: 0.0004185
153
input tokens: 2533; output tokens: 114, price: 0.00044835
180
input tokens: 3008; output tokens: 86, price: 0.0005028
206
input tokens: 3156; output tokens: 89, price: 0.0005268
230
input tokens: 3092; output tokens: 139, price: 0.0005472
250
input tokens: 2940; output tokens: 64, price: 0.0004794
274
input tokens: 3206; output tokens: 81, price: 0.0005295
295
input tokens: 3234; output tokens: 155, price: 0.0005781
323
input tokens: 2646; output tokens: 140, price: 0.0004809


In [163]:
table_of_content

[{'start_paragraph_number': 0, 'title': 'Introduction to Tokenization'},
 {'start_paragraph_number': 5,
  'title': 'Byte Pair Encoding and Advanced Tokenization'},
 {'start_paragraph_number': 10,
  'title': 'Challenges and Complexities of Tokenization'},
 {'start_paragraph_number': 21,
  'title': 'Tokenization in Non-English Languages'},
 {'start_paragraph_number': 26,
  'title': 'Tokenization and Efficiency in Python'},
 {'start_paragraph_number': 37,
  'title': 'Understanding Unicode and Code Points'},
 {'start_paragraph_number': 46, 'title': 'Understanding Unicode Encodings'},
 {'start_paragraph_number': 60, 'title': 'Introduction to Byte Pair Encoding'},
 {'start_paragraph_number': 68,
  'title': 'Introduction to Tokenization and Encoding'},
 {'start_paragraph_number': 86, 'title': 'Iterative Merging of Byte Pairs'},
 {'start_paragraph_number': 90, 'title': 'Introduction to Byte Pair Encoding'},
 {'start_paragraph_number': 102, 'title': 'Training the Tokenizer'},
 {'start_paragraph

In [164]:
with open(f"{DATA_DIR}/{video_id}_toc.json", "w") as f:
        json.dump(table_of_content, f, indent=4)

## 5) Output structured chapter

This last stage combines the paragraphs and the table of content to create a structured JSON with chapters.

In [165]:
def get_chapters(paragraphs, table_of_content):

    chapters = []

    for i in range(len(table_of_content)):


        if i < len(table_of_content) - 1:

            chapter = {'num_chapter': i,
                       'title': table_of_content[i]['title'],
                       'start_paragraph_number': table_of_content[i]['start_paragraph_number'],
                       'end_paragraph_number': table_of_content[i + 1]['start_paragraph_number'],
                       'start_time': paragraphs[table_of_content[i]['start_paragraph_number']]['start_time'],
                       'end_time': paragraphs[table_of_content[i + 1]['start_paragraph_number']]['start_time'],
                      }

        else:
            chapter = {'num_chapter': i,
                       'title': table_of_content[i]['title'],
                       'start_paragraph_number': table_of_content[i]['start_paragraph_number'],
                       'end_paragraph_number': len(paragraphs),
                       'start_time': paragraphs[table_of_content[i]['start_paragraph_number']]['start_time'],
                       'end_time': paragraphs[-1]['start_time'],
                      }

        paragraphs_chapter = [paragraphs[j]['paragraph_text'] for j in
                                range(chapter['start_paragraph_number'], chapter['end_paragraph_number'])]

        paragraph_timestamps_chapter = [paragraphs[j]['start_time'] for j in
                                          range(chapter['start_paragraph_number'], chapter['end_paragraph_number'])]

        chapter['paragraphs'] = paragraphs_chapter
        chapter['paragraph_timestamps'] = paragraph_timestamps_chapter

        chapters.append(chapter)

    return chapters

In [166]:
chapters = get_chapters(paragraphs, table_of_content)


In [167]:
with open(f"{DATA_DIR}/{video_id}.json", "w") as f:
        json.dump(chapters, f, indent=4)

In [168]:
def convert_seconds_to_hms(seconds):
    # Calculate hours, minutes, and remaining seconds
    hours = seconds // 3600
    minutes = (seconds % 3600) // 60
    remaining_seconds = seconds % 60

    # Format the result as HH:MM:SS
    return f"{hours:02}:{minutes:02}:{remaining_seconds:02}"

In [169]:
for chapter in chapters:
    print(convert_seconds_to_hms(chapter['start_time'])+" : "+chapter['title'])

00:00:00 : Introduction to Tokenization
00:02:19 : Byte Pair Encoding and Advanced Tokenization
00:04:16 : Challenges and Complexities of Tokenization
00:09:30 : Tokenization in Non-English Languages
00:11:21 : Tokenization and Efficiency in Python
00:14:59 : Understanding Unicode and Code Points
00:18:11 : Understanding Unicode Encodings
00:23:40 : Introduction to Byte Pair Encoding
00:26:46 : Introduction to Tokenization and Encoding
00:34:09 : Iterative Merging of Byte Pairs
00:34:56 : Introduction to Byte Pair Encoding
00:39:21 : Training the Tokenizer
00:42:37 : Encoding and Decoding Overview
00:48:21 : Implementing Token Encoding
00:55:09 : Handling Special Cases in Encoding
00:56:50 : Introduction to Byte Pair Encoding
00:57:35 : Exploring GPT-2 Tokenization
00:59:15 : Regex Patterns in Tokenization
01:04:30 : Tokenization and Regex Patterns
01:11:36 : Introduction to TikToken Library
01:14:58 : Understanding the GPT-2 Encoder
01:18:24 : Special Tokens in Tokenization
01:25:25 :

### Chapters to Markdown

Let us convert the JSON chapters to Markdown format.

In [94]:
def chapters_to_markdown(chapters):

    markdown = ""

    for i in range(len(chapters)):

        chapter = chapters[i]

        markdown += f"# {chapter['title']}\n\n"

        for j in range(len(chapter['paragraphs'])):

            paragraph = chapter['paragraphs'][j]
            start_time = chapter['paragraph_timestamps'][j]
            from_to = convert_seconds_to_hms(int(start_time))

            markdown += f"{from_to} - {paragraph}\n\n"

    return markdown


In [95]:
markdown = chapters_to_markdown(chapters)

In [96]:
from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))

In [97]:
printmd(markdown[0:1000])

# Introduction to Deep Learning

00:00:00 - [Music] Good afternoon, everyone, and welcome to MIT's 6.S191. My name is Alexander Amini, and I'll be one of your instructors for the course this year, along with Ava. Together, we're really excited to welcome you to this incredible course.

00:00:27 - This is a very fast-paced and intense one-week experience that we're about to go through together. We will cover the foundations of a rapidly evolving field that has been changing significantly over the past eight years that we have taught this course at MIT. In fact, over the past decade, even before we started teaching this course, AI and deep learning have been revolutionizing many different areas of science, mathematics, physics, and more.

00:01:05 - Not long ago, we faced challenges and problems that we did not think were necessarily solvable in our lifetimes. Yet, AI is now solving these problems, often surpassing human performance. Each year that we teach this course, this particular l

### Gradio app

Let us bundle all the stages in a Gradio app

In [88]:
import gradio as gr
import os
import json

from youtube_transcript_api import YouTubeTranscriptApi

import utils

from openai import OpenAI
from groq import Groq

from dotenv import load_dotenv
load_dotenv()
GROQ_API_KEY = os.getenv("GROQ_API_KEY")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

import importlib
importlib.reload(utils)


<module 'utils' from '/Users/yalb/Projects/Github/video-chaptering-github/utils.py'>

In [89]:
#llm_client_format_transcript = Groq(api_key=GROQ_API_KEY)
#llm_model_format_transcript = 'llama3-8b-8192'
#chunk_size_format_transcript = 2000

#llm_client_get_toc = OpenAI(api_key=OPENAI_API_KEY)
#llm_model_get_toc= "gpt-4o-mini-2024-07-18"
#chunk_size_toc = 30

In [90]:
def get_llm_client_and_model(llm_model):
    
    if llm_model == "llama3-8b":
        llm_client = Groq(api_key=GROQ_API_KEY)
        llm_model = 'llama3-8b-8192'
    
    elif llm_model == "gpt-4o-mini":
        llm_client = OpenAI(api_key=OPENAI_API_KEY)
        llm_model = 'gpt-4o-mini-2024-07-18'
    
    return llm_client, llm_model
        

In [91]:
def gradio_process_video(video_id, 
                         model_format_transcript, model_toc,
                         chunk_size_format_transcript, chunk_size_toc, 
                         progress=gr.Progress()):
    
    if video_id in ["ErnWZxJovaM"]:
        chapters = utils.load_json_chapters(video_id)
        
    else:
        
        transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=["en"])
    
        chunk_size_format_transcript = int(chunk_size_format_transcript)
    
        llm_client_format_transcript, llm_model_format_transcript = \
            get_llm_client_and_model(model_format_transcript)
    
        paragraphs, nb_input_tokens, nb_output_tokens, price = \
            utils.transcript_to_paragraphs(transcript,\
                                 llm_client_format_transcript, llm_model_format_transcript,\
                                 chunk_size=chunk_size_format_transcript, progress=progress)

        paragraphs = utils.add_timestamps_to_paragraphs(transcript, paragraphs, num_words=50)
    
        chunk_size_toc = int(chunk_size_toc)
        
        llm_client_get_toc, llm_model_get_toc = \
            get_llm_client_and_model(model_toc)
        
        json_toc, nb_input_tokens, nb_output_tokens, price = \
        utils.paragraphs_to_toc(paragraphs, \
                        llm_client_get_toc, llm_model_get_toc, \
                        chunk_size=chunk_size_toc)
    
        chapters = utils.get_chapters(paragraphs, json_toc)
    
    output_html = utils.get_result_as_html(chapters, video_id)
    
    return {output_processing: str(output_html),
            gv_output: output_html}
    

In [92]:
css = """
.content {
    padding: 20px;
    max-width: 800px;
    margin: 0 auto;
    background-color: #ffffff;
    box-shadow: 0 0 10px rgba(0, 0, 0, 0.1);
    border-radius: 8px;
}
"""

example_video_id = "ErnWZxJovaM"
example_chapters = utils.load_json_chapters(example_video_id)
example_output_html = utils.get_result_as_html(example_chapters, example_video_id)

with (gr.Blocks(css=css) as app):
    
    
    gr.HTML("<div align='center'><h1 class='header'>Demo: Automatic video chaptering with LLMs and TF-IDF</h1></div>")
    gr.HTML("<div align='center'><h3 class='header'>From raw transcript to structured document - Check the <a href=''>Medium article</a> for more details</h3></div>")
    gr.HTML("<hr>")
    gr.Markdown("""This demo relies on 
                - Groq's Llama 3 8B for transcript preprocessing
                - OpenAI's GPT-4o-mini for chaptering. Note: Using GPT-4o-mini for transcript preprocessing will improve results, but takes longer (around 2/3 minutes for a one-hour video)
                
                The following YouTube video ID are already preprocessed (copy and paste ID in box below): 
                
                - `ErnWZxJovaM`: [MIT course](https://www.youtube.com/watch?v=ErnWZxJovaM)
                - `EuC1GWhQdKE`: [Anthropic](https://www.youtube.com/watch?v=EuC1GWhQdKE)
                
                """
                )
        
    gv_transcript = gr.State()
    
    video_id_input = gr.Textbox(label="Enter YouTube Video ID", value = "EuC1GWhQdKE")
    
    with gr.Accordion("Set parameters", open=False):
            
        with gr.Row():
            with gr.Column(scale=1):
                model_format_transcript = gr.Dropdown([("LLama 3 8B (Groq)","llama3-8b"), ("GPT-4o-mini (OpenAI)", "gpt-4o-mini")], label="Transcript preprocessing", value="llama3-8b", interactive=True)
                chunk_size_format_transcript = gr.Textbox(label="Preprocessing chunk size", value = 2000)
            with gr.Column(scale=1):
                model_toc = gr.Dropdown([("LLama 3 8B (Groq)","llama3-8b"), ("GPT-4o-mini (OpenAI)", "gpt-4o-mini")], label="Chaptering", value="gpt-4o-mini", interactive=True)
                chunk_size_toc = gr.Textbox(label="Chaptering chunk size", value = 30)
            with gr.Column(scale=1):
                api_key_openai = gr.Textbox(label="OpenAI API Key", value = "xxx")
                api_key_groq = gr.Textbox(label="Groq API Key", value = "xxx")
    
    processing_button = gr.Button("Process transcript")
    
    gv_output = gr.State()
    
    gr.HTML("<hr>")
    
    output_processing = gr.HTML(label = "Output processing", value=example_output_html)
    
    processing_button.click(gradio_process_video,
                            inputs=[video_id_input, 
                                    model_format_transcript, model_toc,
                                    chunk_size_format_transcript, chunk_size_toc],
                            outputs=[output_processing, gv_output])
    
    #gr.HTML(result_as_html)
    

app.launch(debug=True, width= "100%")

Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.


Number of chunks: 21
i is: 0
input tokens: 483; output tokens: 487, price: 0.00036465
Found paragraphs: 30
i is: 2000
input tokens: 455; output tokens: 424, price: 0.00032265
Found paragraphs: 11
i is: 4000
input tokens: 471; output tokens: 471, price: 0.00035325
Found paragraphs: 14
i is: 6000
input tokens: 513; output tokens: 509, price: 0.00038235
Found paragraphs: 24
i is: 8000
input tokens: 464; output tokens: 397, price: 0.0003078
Found paragraphs: 11
i is: 10000
input tokens: 505; output tokens: 473, price: 0.00035955000000000004
Found paragraphs: 26
i is: 12000
input tokens: 489; output tokens: 425, price: 0.00032834999999999993
Found paragraphs: 21
i is: 14000
input tokens: 508; output tokens: 448, price: 0.000345
Found paragraphs: 10
i is: 16000
input tokens: 505; output tokens: 476, price: 0.00036135
Found paragraphs: 24
i is: 18000
input tokens: 447; output tokens: 422, price: 0.00032025
Found paragraphs: 20
i is: 20000
input tokens: 463; output tokens: 433, price: 0.000329

