# Divide a video/audio into chapters

- https://spacy.io/api/sentencizer

- https://www.kaggle.com/code/davidfmora/automatic-chapterization-and-question-answering
- https://www.kaggle.com/code/davidfmora/fabiobot
- https://towardsdatascience.com/text-tiling-done-right-building-solid-foundations-for-your-personal-llm-e70947779ac1
- https://www.sievedata.com/blog/ai-auto-video-chapters


add drive to colab chapter seg




### Connect Google Drive


In [1]:

from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
ls /content/drive/MyDrive/AI_Material/Advanced/VideoChapterGenerator/


[0m[01;34mErnWZxJovaM[0m/  video-chapter-generator.ipynb


# Install and import libraries

- `pytube`: used to download a Youtube video
- `youtube-transcript-api`: used to directly download the video transcript from Youtube, if available
- `faster_whisper`: used to get transcript from audio
- `anthropic`: used to access Claude 3.0 large multimodal model

In [1]:
%%capture
!pip install -q yt_dlp
!pip install -q youtube-transcript-api
!pip install -q faster_whisper
!pip install -q openai
!pip install -q gradio

Let us load the libraries


In [2]:
import os
from pathlib import Path
import json
import re

import yt_dlp
from youtube_transcript_api import YouTubeTranscriptApi
from faster_whisper import WhisperModel
import torch

#import cv2 #Used to extract frames from video
from openai import OpenAI

import markdown

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

#from dotenv import load_dotenv
#load_dotenv()

Define Youtube video ID, folder to store video, chapters, and resulting blog post.


In [3]:
# Usage example
video_id = 'dJ3sLNRNNAM'
video_id = 'ErnWZxJovaM' # MIT course
#video_id = 'hUPrUJ4F_Ro' # TED talk
#video_id = 'zjkBMFhNj_g' # Intro LLM Karpathy

DATA_DIR= f"/content/drive/MyDrive/AI_Material/Advanced/VideoChapterGenerator/{video_id}"

DATA_DIR = "tmp/{video_id}"

if not os.path.exists(DATA_DIR):
    os.makedirs(DATA_DIR)


## Get transcript



### With YouTubeTranscriptApi

In [4]:
transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=["en"])
transcript[0:4]

[{'text': '[Music]', 'start': 1.17, 'duration': 3.21},
 {'text': 'good afternoon everyone and welcome to',
  'start': 10.28,
  'duration': 6.56},
 {'text': 'MIT sus1 191 my name is Alexander amini',
  'start': 12.88,
  'duration': 5.44},
 {'text': "and I'll be one of your instructors for",
  'start': 16.84,
  'duration': 4.72}]

In [7]:
len(transcript)

1789

### With Whisper



#### Download audio

In [None]:
def download_audio(video_id, DOWNLOAD_DIR="temp_download"):

    os.makedirs(f"{DOWNLOAD_DIR}", exist_ok=True)
    os.makedirs(f"{DOWNLOAD_DIR}/{video_id}", exist_ok=True)

    audio_path = f"{DOWNLOAD_DIR}/{video_id}/{video_id}_audio.mp4"

    # Define options for yt-dlp
    ydl_opts = {
        'format': f'bestaudio',  # Select the best quality format
        'outtmpl': audio_path
    }

    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        video_url = 'https://www.youtube.com/watch?v=' + video_id
        ydl.download([video_url])


    return audio_path

# About 5 seconds for a one hour video (65MB of audio)
%time audio_path=download_audio(video_id, DATA_DIR)

#### Transcribe with Whisper

In [None]:
whisper_model = WhisperModel("large-v3",
                              device="cuda" if torch.cuda.is_available() else "cpu",
                              compute_type="float16",
                            )

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


model.bin:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

config.json:   0%|          | 0.00/2.39k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

vocabulary.json:   0%|          | 0.00/1.07M [00:00<?, ?B/s]

In [None]:
def speech_to_text(whisper_model, audio_file, initial_prompt="Use punctuation, like this.", language="en", segments=None):

        segments, transcript_info = whisper_model.transcribe(audio_file,  initial_prompt=initial_prompt, language=language)
        segments = list(segments)
        segments = [
            {
                "start": round(s.start,2),
                "duration": round(s.end-s.start,2),
                "text": s.text,
            }
            for s in segments
        ]

        return segments

In [None]:
#14 minutes for a 1h10 video on T4
%time transcript = speech_to_text(whisper_model, audio_path)

CPU times: user 14min 11s, sys: 14.7 s, total: 14min 25s
Wall time: 13min 48s


In [None]:
transcript[0:3]

[{'start': 0.0,
  'duration': 15.28,
  'text': ' Good afternoon, everyone, and welcome to MIT Success 191.'},
 {'start': 15.28,
  'duration': 4.04,
  'text': " My name is Alexander Amini, and I'll be one of your instructors for the course this year"},
 {'start': 19.32, 'duration': 2.08, 'text': ' along with Ava.'}]

In [7]:
with open(f"{DATA_DIR}/{video_id}_transcript.json", "w") as f:
        json.dump(transcript, f, indent=4)

## Use LLM to improve readability

We want to add punctuation to the transcript, and also improve the transcript's readability (removing verbal tics, correcting grammatical errors, and adding appropriate line breaks)
Two constraints are important to consider:
 1. Models do not handle well long inputs. The longer the transcript, the higher the risk that the model will skip part of the input.
 2. The timestamps will be removed

Solutions:
- Split the transcript in chunks of 5000 characters
- Timestamps will be readded at a later stage


#### Concatenate text



In [5]:
def get_transcript_as_text(transcript):
    temp_list = [s['text'] for s in transcript]
    transcript_as_text = ' '.join(temp_list)

    return transcript_as_text

transcript_as_text = get_transcript_as_text(transcript)
print(len(transcript_as_text))
transcript_as_text[0:1000]

66632


"[Music] good afternoon everyone and welcome to MIT sus1 191 my name is Alexander amini and I'll be one of your instructors for the course this year along with Ava and together we're really excited to welcome you to this really incredible course this is a very fast-paced and very uh intense one week that we're about to go through together right so we're going to cover the foundations of a also very fast-paced moving field and a field that has been rapidly changing over the past eight years that we have taught this course at MIT now over the past decade in fact even before we started teaching this course Ai and deep learning has really been revolutionizing so many different advances and so many different areas of science meth mathematics physics and and so on and not that long ago we were having new types of we were having challenges and problems that we did not think were necessarily solvable in our lifetimes that AI is now actually solving uh Beyond human performance today and each ye

Put your LLM API key here:


In [32]:
api_key ="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"
openai_client = OpenAI(api_key=api_key)
openai_client = OpenAI(base_url="http://localhost:1234/v1", api_key="lm-studio")
openai_model = 'gpt-4o-mini'

In [None]:
model="lmstudio-community/Meta-Llama-3.1-8B-Instruct-GGUF",

In [22]:
openai_client

<openai.OpenAI at 0x286add4d0>

In [7]:
price_token={'gpt-4-turbo-2024-04-09': {'input': 10/1000000, 'output': 30/1000000},
             'gpt-3.5-turbo-0125': {'input': 0.5/1000000, 'output': 1.5/1000000},
             'gpt-4o': {'input': 5/1000000, 'output': 15/1000000},
             'gpt-4o-mini': {'input': 0.15/1000000, 'output': 0.6/1000000},
             'claude-3-opus-20240229': {'input': 15/1000000, 'output': 75/1000000},
             'claude-3-5-sonnet-20240620': {'input': 3/1000000, 'output': 15/1000000},
             'claude-3-haiku-20240307': {'input': 0.25/1000000, 'output': 1.25/1000000},
             'gemini-1.5-flash': {'input': 0.35 / 1000000, 'output': 1.05 / 1000000},
             'gemini-1.5-pro': {'input': 3.5 / 1000000, 'output': 10.5 / 1000000},
             }

In [26]:
def call_open_ai_model(client, model, system_prompt, prompt,
                       temperature=0, seed=42, response_format=None):

    response = client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": system_prompt
            },
            {
                "role": "user",
                "content": prompt
            }
        ],
        model="lmstudio-community/Meta-Llama-3.1-8B-Instruct-GGUF",
        temperature=temperature,
        seed=seed,
        response_format=response_format
    )

    nb_input_tokens = response.usage.prompt_tokens
    nb_output_tokens = response.usage.completion_tokens
    price = nb_input_tokens * price_token[model]['input'] + nb_output_tokens * price_token[model]['output']

    print(f"input tokens: {nb_input_tokens}; output tokens: {nb_output_tokens}, price: {price}")

    response_content=response.choices[0].message.content

    return response_content, nb_input_tokens, nb_output_tokens, price


In [9]:
system_prompt_transcript_to_paragraphs = f"""

You are an expert at improving readability of raw audio transcripts.

You are given a raw audio transcript. Your task is to improve the transcript's readability: add punctuation if needed, remove verbal tics, correct grammatical errors, and add appropriate line breaks (with '\n\n').

Ensure that the your output maintains the original meaning and flow of the transcript while being more polished and easier to read.

Do not include any other information in your response.

"""


In [10]:
response_content, nb_input_tokens, nb_output_tokens, price = \
            call_open_ai_model(openai_client, openai_model, system_prompt_transcript_to_paragraphs, transcript_as_text[0:5000],
                               temperature=0, seed=42, response_format=None)

input tokens: 1147; output tokens: 618, price: 0.0005428499999999999


In [11]:
print(response_content)

 [Music] Good afternoon, everyone, and welcome to MIT 16.119! My name is Alexander Amini, and I'll be one of your instructors for the course this year, along with Ava. We are really excited to welcome you to this incredible course. This is a fast-paced, intense one-week course that we're about to go through together.

We will cover the foundations of AI and deep learning, a field that has been rapidly changing over the past eight years since we started teaching this course at MIT. In fact, even before we began teaching it, AI and deep learning have revolutionized so many different areas of science, mathematics, physics, and more.

Not long ago, we were facing challenges and problems that we didn't think were necessarily solvable in our lifetimes, but AI is now solving them beyond human performance. Each year that we teach this course, especially this lecture, becomes harder to teach because, for an introductory level course, it should cover the foundations. However, in a rapidly changi

### Split transcript in chunks for long videos

In [13]:
def transcript_to_paragraphs(transcript, openai_client, openai_model, chunk_size=5000):

    transcript_as_text = get_transcript_as_text(transcript)

    paragraphs = []
    last_paragraph = ""

    total_nb_input_tokens, total_nb_output_tokens, total_price = 0, 0, 0

    for i in range(0, 10000, chunk_size): #(transcript), chunk_size):
    #for i in range(0, len(transcript), chunk_size):
        chunk = last_paragraph + " " + transcript_as_text[i:i + chunk_size]

        response_content, nb_input_tokens, nb_output_tokens, price = \
            call_open_ai_model(openai_client, openai_model, system_prompt = system_prompt_transcript_to_paragraphs, prompt = chunk,
                               temperature=0, seed=42, response_format=None)

        total_nb_input_tokens += nb_input_tokens
        total_nb_output_tokens += nb_output_tokens
        total_price += price

        paragraphs_chunk = response_content.split('\n\n')

        print('Found paragraphs:', len(paragraphs_chunk))
        last_paragraph = paragraphs_chunk[-1]

        paragraphs += paragraphs_chunk[:-1]

    paragraphs += [last_paragraph]

    paragraphs_dict = [{'paragraph_number': i, 'paragraph_text': paragraph} for i, paragraph in enumerate(paragraphs)]

    return paragraphs_dict, total_nb_input_tokens, total_nb_output_tokens, total_price


In [14]:
paragraphs, nb_input_tokens, nb_output_tokens, price = transcript_to_paragraphs(transcript, openai_client, openai_model, chunk_size=5000)

input tokens: 1033; output tokens: 845, price: 0.0006619499999999999
Found paragraphs: 14
input tokens: 1050; output tokens: 958, price: 0.0007323
Found paragraphs: 17


In [15]:
len(paragraphs)

30

In [16]:
paragraphs[0:2]

[{'paragraph_number': 0, 'paragraph_text': '[Music] '},
 {'paragraph_number': 1,
  'paragraph_text': "Good afternoon, everyone, and welcome to MIT's 6.S191. My name is Alexander Amini, and I'll be one of your instructors for the course this year, along with Ava. Together, we're really excited to welcome you to this incredible course. "}]

In [17]:
with open(f"{DATA_DIR}/{video_id}_paragraphs.json", "w") as f:
        json.dump(paragraphs, f, indent=4)

## Get table of content


In [16]:
paragraphs_json_dump = json.dumps(paragraphs)

In [17]:
paragraphs_json_dump[0:1000]

'[{"paragraph_number": 0, "paragraph_text": "[Music] "}, {"paragraph_number": 1, "paragraph_text": "Good afternoon, everyone, and welcome to MIT\'s 6.S191. My name is Alexander Amini, and I\'ll be one of your instructors for the course this year, along with Ava. Together, we\'re really excited to welcome you to this incredible course. "}, {"paragraph_number": 2, "paragraph_text": "This is a very fast-paced and intense one week that we\'re about to go through together. We will cover the foundations of a rapidly evolving field that has changed significantly over the past eight years since we began teaching this course at MIT. "}, {"paragraph_number": 3, "paragraph_text": "In fact, over the past decade, even before we started teaching this course, AI and deep learning have been revolutionizing many different areas of science, mathematics, physics, and more. Not long ago, we faced challenges and problems that we did not think were solvable in our lifetimes, but AI is now solving them, ofte

In [18]:
system_prompt_paragraphs_to_sections = f"""

	You are an expert at creating table of contents for online courses.

	You will be given a transcript of a course in JSON format as a list of paragraphs, each containing 'paragraph_number' and 'paragraph_text' keys.

	Your task is to group consecutive paragraphs in sections for the course and identify meaningful section titles.

	Here are the steps to follow:

1. Read the transcript carefully to understand its general structure and the main topics covered.
2. Look for clues that a new section is about to start. This could be a change of topic, a change of time or setting, the introduction of new themes or topics, or the speaker's explicit mention of a new section.
3. For each section, keep track of the paragraph number that starts the section a,d identify a meaningful section title.
4. Sections should ideally be equally spaced throughout the transcript, and discuss a specific topic.
5. A section MUST have more than 4 paragraphs.
6. A section should group between 10 and 30 paragraphs.

	Format your result in JSON, with a list dictionaries for sections, with 'start_paragraph_number':integer and 'title':string as key:value.

"""


In [34]:
def paragraphs_to_toc(paragraphs, openai_client, openai_model, chunk_size=100):

    sections = []
    number_last_section = 0

    total_nb_input_tokens, total_nb_output_tokens, total_price = 0, 0, 0

    while number_last_section < len(paragraphs):

        print(number_last_section)

        chunk = paragraphs[number_last_section:(number_last_section + chunk_size)]

        chunk_json_dump = json.dumps(chunk)

        toc_content, nb_input_tokens, nb_output_tokens, price = \
            call_open_ai_model(openai_client, openai_model,
                               system_prompt = system_prompt_paragraphs_to_sections,  prompt = chunk_json_dump,
                               temperature=0.7, seed=0, response_format={"type": "json_object"})

        total_nb_input_tokens += nb_input_tokens
        total_nb_output_tokens += nb_output_tokens
        total_price += price

        toc_content = json.loads(toc_content)['sections']
        print(toc_content)

        if number_last_section == toc_content[-1]['start_paragraph_number']:
            break

        number_last_section = toc_content[-1]['start_paragraph_number']

        sections += toc_content[:-1]

    sections += [toc_content[-1]]

    return sections, total_nb_input_tokens, total_nb_output_tokens, total_price

In [36]:
len(paragraphs)

30

In [35]:
table_of_content, nb_input_tokens, nb_output_tokens, price = paragraphs_to_toc(paragraphs, openai_client, openai_model, chunk_size=100)

0
input tokens: 2446; output tokens: 215, price: 0.0004959


JSONDecodeError: Expecting value: line 1 column 1 (char 0)

In [25]:
table_of_content

[{'start_paragraph_number': 1,
  'title': 'Introduction to the Course and Overview of AI'},
 {'start_paragraph_number': 18,
  'title': 'Understanding Intelligence and Deep Learning'},
 {'start_paragraph_number': 25, 'title': 'Course Overview and Structure'}]

In [26]:
with open(f"{DATA_DIR}/{video_id}_toc.json", "w") as f:
        json.dump(table_of_content, f, indent=4)

### Add timestamps to paragraphs



In [27]:
def transform_text_segments(text_segments, num_words=50):
    # Initialize variables
    transformed_segments = []
    current_index = 0
    num_segments = len(text_segments)

    for i in range(num_segments):

        current_index = i

        # Get the current segment's starting timestamp and text
        current_segment = text_segments[current_index]
        current_start = current_segment['start']
        current_text = current_segment['text']

        # Initialize a list to hold the combined text
        combined_text = " ".join(current_text.split()[:num_words])
        words_collected = len(current_text.split())

        # Collect words from subsequent segments
        while words_collected < num_words and (current_index + 1) < num_segments:
            current_index += 1
            next_segment = text_segments[current_index]
            next_text = next_segment['text']
            next_words = next_text.split()

            # Append words from the next segment
            if words_collected + len(next_words) <= num_words:
                combined_text += ' ' + next_text
                words_collected += len(next_words)
            else:
                # Only append enough words to reach the num_words limit
                words_needed = num_words - words_collected
                combined_text += ' ' + ' '.join(next_words[:words_needed])
                words_collected = num_words

        # Append the combined segment to the result
        transformed_segments.append(combined_text)

    return transformed_segments

In [28]:
def find_best_match_index(transcript, paragraph, num_words=50):

    text_list = transform_text_segments(transcript, num_words=num_words)

    # Create a TF-IDF vectorizer
    vectorizer = TfidfVectorizer().fit_transform(text_list + [paragraph])

    # Get the TF-IDF vectors for the transcript and the excerpt
    vectors = vectorizer.toarray()

    # Extract the TF-IDF vector for the excerpt
    excerpt_vector = vectors[-1]

    # Calculate the cosine similarity between the excerpt vector and each transcript chunk
    similarities = cosine_similarity(vectors[:-1], excerpt_vector.reshape(1, -1))
    # print(similarities)

    # Find the index of the most similar chunk
    best_match_index = int(np.argmax(similarities))

    return best_match_index


def add_timestamps_to_paragraphs(transcript, paragraphs, num_words=50):
    list_indices = []

    paragraphs_start_text = [{"start": p['paragraph_number'], "text": p['paragraph_text']} for p in paragraphs]
    paragraphs_50 = transform_text_segments(paragraphs_start_text, num_words=num_words)

    for i in range(len(paragraphs_50)):

        best_match_index = find_best_match_index(transcript, paragraphs_50[i], num_words=num_words)

        list_indices.append(best_match_index)

        paragraphs[i]['matched_index'] = best_match_index
        paragraphs[i]['matched_text'] = transcript[best_match_index]['text']
        paragraphs[i]['start_time'] = int(transcript[best_match_index]['start'])


    return paragraphs

In [29]:
%%time
paragraphs = add_timestamps_to_paragraphs(transcript, paragraphs, num_words=50)

CPU times: user 5.28 s, sys: 3.73 s, total: 9 s
Wall time: 5.69 s


In [30]:
paragraphs[0:5]

[{'paragraph_number': 0,
  'paragraph_text': '[Music] ',
  'matched_index': 0,
  'matched_text': '[Music]',
  'start_time': 1},
 {'paragraph_number': 1,
  'paragraph_text': "Good afternoon, everyone, and welcome to MIT's 6.S191. My name is Alexander Amini, and I'll be one of your instructors for the course this year, along with Ava. Together, we're really excited to welcome you to this incredible course. ",
  'matched_index': 1,
  'matched_text': 'good afternoon everyone and welcome to',
  'start_time': 10},
 {'paragraph_number': 2,
  'paragraph_text': "This is a very fast-paced and intense one-week experience that we're about to go through together. We will cover the foundations of a rapidly evolving field that has changed significantly over the past eight years since we began teaching this course at MIT. ",
  'matched_index': 8,
  'matched_text': "intense one week that we're about to go",
  'start_time': 29},
 {'paragraph_number': 3,
  'paragraph_text': 'In fact, over the past decade

In [31]:
with open(f"{DATA_DIR}/{video_id}_paragraphs.json", "w") as f:
        json.dump(paragraphs, f, indent=4)

## Get chapters

In [160]:
transcript[-1]

{'text': 'presentation okay thank you', 'start': 4191.199, 'duration': 4.48}

In [41]:
def get_chapters(paragraphs, table_of_content):

    chapters = []

    for i in range(len(table_of_content)):


        if i < len(table_of_content) - 1:

            chapter = {'num_chapter': i,
                       'title': table_of_content[i]['title'],
                       'start_paragraph_number': table_of_content[i]['start_paragraph_number'],
                       'end_paragraph_number': table_of_content[i + 1]['start_paragraph_number'],
                       'start_time': paragraphs[table_of_content[i]['start_paragraph_number']]['start_time'],
                       'end_time': paragraphs[table_of_content[i + 1]['start_paragraph_number']]['start_time'],
                      }

        else:
            chapter = {'num_chapter': i,
                       'title': table_of_content[i]['title'],
                       'start_paragraph_number': table_of_content[i]['start_paragraph_number'],
                       'end_paragraph_number': len(paragraphs),
                       'start_time': paragraphs[table_of_content[i]['start_paragraph_number']]['start_time'],
                       'end_time': paragraphs[-1]['start_time'],
                      }

        paragraphs_chapter = [paragraphs[j]['paragraph_text'] for j in
                                range(chapter['start_paragraph_number'], chapter['end_paragraph_number'])]

        paragraph_timestamps_chapter = [paragraphs[j]['start_time'] for j in
                                          range(chapter['start_paragraph_number'], chapter['end_paragraph_number'])]

        chapter['paragraphs'] = paragraphs_chapter
        chapter['paragraph_timestamps'] = paragraph_timestamps_chapter

        chapters.append(chapter)

    return chapters

In [42]:
chapters = get_chapters(paragraphs, table_of_content)


In [43]:
chapters

[{'num_chapter': 0,
  'title': 'Introduction to the Course and Overview of AI',
  'start_paragraph_number': 1,
  'end_paragraph_number': 18,
  'start_time': 10,
  'end_time': 450,
  'paragraphs': ["Good afternoon, everyone, and welcome to MIT's 6.S191. My name is Alexander Amini, and I'll be one of your instructors for the course this year, along with Ava. Together, we're really excited to welcome you to this incredible course. ",
   "This is a very fast-paced and intense one-week experience that we're about to go through together. We will cover the foundations of a rapidly evolving field that has changed significantly over the past eight years since we began teaching this course at MIT. ",
   'In fact, over the past decade, even before we started teaching this course, AI and deep learning have been revolutionizing many different areas of science, mathematics, physics, and more. Not long ago, we faced challenges and problems that we did not think were solvable in our lifetimes, but AI 

In [44]:
with open(f"{DATA_DIR}/{video_id}.json", "w") as f:
        json.dump(chapters, f, indent=4)

### Chapters to Markdown

In [47]:
def convert_seconds_to_hms(seconds):
    # Calculate hours, minutes, and remaining seconds
    hours = seconds // 3600
    minutes = (seconds % 3600) // 60
    remaining_seconds = seconds % 60

    # Format the result as HH:MM:SS
    return f"{hours:02}:{minutes:02}:{remaining_seconds:02}"

In [54]:
def chapters_to_markdown(chapters):

    markdown = ""

    for i in range(len(chapters)):

        chapter = chapters[i]

        markdown += f"# {chapter['title']}\n\n"

        for j in range(len(chapter['paragraphs'])):

            paragraph = chapter['paragraphs'][j]
            start_time = chapter['paragraph_timestamps'][j]
            from_to = convert_seconds_to_hms(int(start_time))

            markdown += f"{from_to} - {paragraph}\n\n"

    return markdown


In [55]:
markdown = chapters_to_markdown(chapters)

In [56]:
from IPython.display import Markdown, display
def printmd(string):
    display(Markdown(string))

In [57]:
printmd(markdown)

# Introduction to the Course and Overview of AI

00:00:10 - Good afternoon, everyone, and welcome to MIT's 6.S191. My name is Alexander Amini, and I'll be one of your instructors for the course this year, along with Ava. Together, we're really excited to welcome you to this incredible course. 

00:00:29 - This is a very fast-paced and intense one-week experience that we're about to go through together. We will cover the foundations of a rapidly evolving field that has changed significantly over the past eight years since we began teaching this course at MIT. 

00:00:43 - In fact, over the past decade, even before we started teaching this course, AI and deep learning have been revolutionizing many different areas of science, mathematics, physics, and more. Not long ago, we faced challenges and problems that we did not think were solvable in our lifetimes, but AI is now solving them, often surpassing human performance. 

00:01:22 - Each year that we teach this course, this particular lecture is getting harder and harder to deliver. For an introductory level course, this first lecture is supposed to cover the foundations. If you think about any other introductory course, like a 101 course in mathematics or biology, those first lectures don't change much over time. However, we are in a rapidly changing field of AI and deep learning, where even these introductory lectures are evolving quickly. 

00:01:55 - Let me give you an example of how we introduced this course only a few years ago: 

00:02:01 - "Hi, everybody, and welcome to MIT 6.S191, the official introductory course on deep learning taught here at MIT. Deep learning is revolutionizing many fields, from robotics to medicine and everything in between. You'll learn the fundamentals of this field and how you can build incredible algorithms. In fact, this entire speech and video are not real and were created using deep learning and artificial intelligence. It has been an honor to speak with you today, and I hope you enjoy the course."

00:03:07 - The really surprising thing about that video, when we first created it, was how viral it went a few years ago. Within just a couple of months of teaching this course, that video received over a million views. People were shocked by a few things, but the main one was the realism of AI in generating content that looks and sounds hyper-realistic. 

00:03:29 - When we created this video a few years ago, it cost us about $10,000 in compute to generate just a minute-long video. If you think about it, that's extremely expensive for such a short piece of content. Many of you might not even be impressed by the technology today because you see all the amazing things that AI and deep learning are producing now. 

00:03:58 - Fast forward to today, and the progress in deep learning is astounding. People were making all kinds of exciting remarks about it when it first came out a few years ago, but now this technology is common. AI is doing much more powerful things than that fun little introductory video. 

00:04:21 - So, where are we now, about four years later? AI is now generating content with deep learning being so commoditized. Deep learning is at our fingertips, available online and on our smartphones. In fact, we can use deep learning to generate hyper-realistic pieces of media and content entirely from English language prompts, without even needing to code anymore. 

00:04:48 - Before, we had to train these models and write code to create that one-minute video. Today, we have models that can do that for us end-to-end, directly from English language instructions. We can ask these models to create something the world has never seen before, like a photo of an astronaut riding a horse. These models can imagine and generate such content entirely from scratch. 

00:05:11 - My personal favorite is how we can now ask these deep learning models to create new types of software. For example, we can ask them to write TensorFlow code to train a neural network. We're asking a neural network to write code to train another neural network, and our model can produce functional and usable pieces of code that satisfy this English prompt, while also educating the user on what each part of the code does. 

00:05:59 - You can see examples of this, and what I'm trying to highlight is just how far deep learning has come in just a couple of years since we started teaching this course. 

00:06:09 - About eight years ago, the most amazing thing that you'll see in this course, in my opinion, is that we try to teach you the foundations of all of this: how all of these different types of models are created from the ground up, and how we can make all of these amazing advances possible so that you can also do it on your own. 

00:06:31 - Like I mentioned in the beginning, this introductory course is getting harder and harder to do and to make every year. I don't know where the field is going to be next year, and that's my honest truth. Honestly, I don't know even where it will be in one or two months from now, just because it's moving so incredibly fast. 

00:06:52 - What I do know is that what we will share with you in this course, as part of this one week, is going to be the foundations of all of the technologies that we have seen up until this point. This will allow you to create that future for yourselves and to design brand new types of deep learning models using those fundamentals and foundations. 

00:07:15 - So, let's get started and figure out how we can actually achieve all of these different pieces and learn all of these different components. We should start by really tackling the foundations from the very beginning and asking ourselves: 

# Understanding Intelligence and Deep Learning

00:07:30 - We've heard this term, and I think all of you, obviously, before you've come to this class today, have heard the term "deep learning." But it's important for you to really understand how this concept of deep learning relates to all of the other pieces of science that you've learned about so far. 

00:07:47 - To do that, we have to start from the very beginning and think about what intelligence is at its core—not even artificial intelligence, but just intelligence. The way I like to think about this is that intelligence is the ability to process information, which will inform your future decision-making abilities. 

00:08:09 - Now, that's something that we as humans do every single day. Artificial intelligence is simply the ability for us to give computers that same ability to process information and inform future decisions. 

00:08:25 - Machine learning is simply a subset of artificial intelligence. The way you should think of machine learning is as the programming ability—or, let's say, even simpler than that: machine learning is the science of trying to teach computers how to do that processing of information and decision-making from data. 

00:08:47 - Instead of hardcoding some of these rules into machines and programming them like we used to do in software engineering classes, we're now going to try to do that processing of information and inform future decision-making abilities directly from data. 

00:09:02 - Going one step deeper, deep learning is simply the subset of machine learning that uses neural networks to do that. It uses neural networks to process raw pieces of unprocessed data and allows them to ingest all of those very large data sets to inform future decisions. 

00:09:23 - Now, that's exactly what this class is really all about. If I had to summarize this class in just one line, it's about teaching machines how to process data, process information, and inform decision-making abilities from that data, learning from that data. 

# Course Overview and Structure

00:09:39 - This program is split between two different parts. You should think of this class as being captured with both technical lectures, which for example this is one part of, as well as software labs. We'll have several new updates this year, as I mentioned earlier, covering the rapid changes and advances in AI. 

00:10:02 - In some of the later lectures, you're going to see those updates. The first lecture today is going to cover the foundations of neural networks themselves, starting with the building blocks of every single neural network, which is called the perceptron. 

00:10:17 - Finally, we'll go through the week and conclude with a series of exciting guest lectures from industry-leading sponsors of the course. On the software side, after every lecture, you'll also get software experience and project-building experience to take what we teach in lectures and actually deploy them in real code, producing based on the learnings that you find in this lecture. 

00:10:44 - At the very end of the class, from the software side, you'll have the ability to participate in a really fun day, which is the project pitch competition. It's kind of like a Shark Tank-style competition for all of the different projects from all of you, where you can win some really awesome prizes. 

00:10:59 - So, let's step through that a little bit briefly. This is the syllabus part of the lecture. Each day, we'll have dedicated software labs that will basically mirror all of the technical lectures we go through, helping you reinforce your learnings. These are coupled with prizes for the top-performing software solutions that are coming up.



In [None]:
def get_transcript_as_text(transcript):
    temp_list = [s['text'] for s in transcript]
    transcript_as_text = ' '.join(temp_list)

    return transcript_as_text

transcript_as_text = get_transcript_as_text(transcript)
print(len(transcript_as_text))
transcript_as_text[0:1000]

### Gradio app

In [72]:
import gradio as gr

def gradio_video_id_to_transcript(video_id):

    transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=["en"])
    transcript_formatted = [{'start': entry['start'], 'text': entry['text']} for entry in transcript[0:10]]
    transcript_formatted_str = json.dumps(transcript_formatted, indent=2)+'...'

    return {output_transcript: transcript_formatted_str,
            gv_transcript: transcript}

def gradio_transcript_to_paragraphs(gv_transcript_value):

    paragraphs, nb_input_tokens, nb_output_tokens, price = \
        transcript_to_paragraphs(gv_transcript_value, openai_client, openai_model, chunk_size=5000)

    paragraphs_formatted_str = json.dumps(paragraphs[0:4], indent=2)+'...'

    return {output_paragraphs: paragraphs_formatted_str,
            gv_paragraphs: paragraphs}

def gradio_paragraphs_to_toc(gv_paragraphs_value):

    paragraphs_dict = gv_paragraphs_value

    json_toc, nb_input_tokens, nb_output_tokens, price = \
      paragraphs_to_toc(paragraphs_dict, openai_client, openai_model, chunk_size=100)

    json_toc_formatted_str = json.dumps(json_toc[0:4], indent=2)+'...'

    return {output_toc: json_toc_formatted_str,
            gv_toc: json_toc}


def gradio_get_paragraphs_timestamps(gv_transcript_value, gv_paragraphs_value):

    paragraphs = add_timestamps_to_paragraphs(gv_transcript_value, gv_paragraphs_value, num_words=50)

    paragraphs_formatted_str = json.dumps(paragraphs[0:4], indent=2)+'...'

    return {output_paragraphs_timestamps: paragraphs_formatted_str,
            gv_paragraphs: paragraphs}


def gradio_get_chapters(gv_paragraphs_value, gv_toc_value):

    chapters = get_chapters(gv_paragraphs_value, gv_toc_value)

    chapters_formatted_str = json.dumps(chapters[0:4], indent=2)+'...'

    return {output_chapters: chapters_formatted_str,
            gv_chapters: chapters}


def gradio_get_markdown(gv_chapters_value):

    markdown = chapters_to_markdown(gv_chapters_value)

    return markdown



In [73]:
with gr.Blocks() as app:

    gr.Markdown("## Get transcript")

    gv_transcript = gr.State()
    video_id_input = gr.Textbox(label="Video ID", value = "ErnWZxJovaM")
    get_transcript_button = gr.Button("Get transcript")
    output_transcript = gr.Textbox(label = "Transcript (JSON format - start, text)")

    get_transcript_button.click(gradio_video_id_to_transcript,
                                inputs=[video_id_input],
                                outputs=[output_transcript, gv_transcript])

    gr.Markdown("## Transcript to paragraphs")

    gv_paragraphs = gr.State()
    get_paragraphs_button = gr.Button("Get paragraphs")
    output_paragraphs = gr.Textbox(label = "Paragraphs (JSON format - paragraph_number, paragraph_text)")

    get_paragraphs_button.click(gradio_transcript_to_paragraphs,
                                inputs=[gv_transcript],
                                outputs=[output_paragraphs, gv_paragraphs])

    gr.Markdown("## Get table of content")

    gv_toc = gr.State()
    get_toc_button = gr.Button("Get table of contents")
    output_toc = gr.Textbox(label = "Table of content (JSON format - paragraph_number, title)")

    get_toc_button.click(gradio_paragraphs_to_toc,
                         inputs=[gv_paragraphs],
                         outputs=[output_toc, gv_toc])


    gr.Markdown("## Infer paragraph timestamps with TF-IDF")

    get_timestamps_button = gr.Button("Infer paragraph timestamps")
    output_paragraphs_timestamps = gr.Textbox(label = "Paragraphs (JSON format - paragraph_number, paragraph_text, start)")

    get_timestamps_button.click(gradio_get_paragraphs_timestamps,
                                inputs=[gv_transcript, gv_paragraphs],
                                outputs=[output_paragraphs_timestamps, gv_paragraphs])

    gr.Markdown("## Get chapters")

    gv_chapters = gr.State()
    get_chapters_button = gr.Button("Get chapters")
    output_chapters = gr.Textbox(label = "Chapters (JSON format)")

    get_chapters_button.click(gradio_get_chapters,
                              inputs=[gv_paragraphs, gv_toc],
                              outputs=[output_chapters, gv_chapters])


    gr.Markdown("## Markdown formatting")

    get_markdown_button = gr.Button("Markdown formatting")
    output_markdown = gr.Markdown(label = "Chapters (Markdown format)")

    get_markdown_button.click(gradio_get_markdown,
                              inputs=[gv_chapters],
                              outputs=[output_markdown])


app.launch(debug=True)

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://d9191e30a655c2c1f9.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


input tokens: 1033; output tokens: 846, price: 0.00066255
Found paragraphs: 14
input tokens: 1050; output tokens: 944, price: 0.0007239
Found paragraphs: 17
0
input tokens: 2404; output tokens: 87, price: 0.00041279999999999995
[{'start_paragraph_number': 1, 'title': 'Introduction to the Course and Overview of AI'}, {'start_paragraph_number': 18, 'title': 'Understanding Intelligence and Deep Learning'}, {'start_paragraph_number': 25, 'title': 'Course Structure and Learning Outcomes'}]
25
input tokens: 615; output tokens: 33, price: 0.00011204999999999999
[{'start_paragraph_number': 25, 'title': 'Course Overview and Structure'}]
Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://d9191e30a655c2c1f9.gradio.live


