# Youtube Video Transcript

In [4]:
! pip install youtube_transcript_api

Collecting youtube_transcript_api
  Downloading youtube_transcript_api-1.1.1-py3-none-any.whl.metadata (23 kB)
Downloading youtube_transcript_api-1.1.1-py3-none-any.whl (485 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/485.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━[0m [32m307.2/485.9 kB[0m [31m10.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.9/485.9 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: youtube_transcript_api
Successfully installed youtube_transcript_api-1.1.1


In [5]:
from youtube_transcript_api import YouTubeTranscriptApi



In [6]:
# First way of splitting the link manually
def get_video_id(url_link:str)->str:
  return url_link.split('watch?v=')[1].split('&')[0]

In [7]:
# second way to split the link using predefined library
from urllib.parse import urlparse, parse_qs  # This imports Python tools that help you work with URLs.

def get_video_id_2(url_link:str)->str:
  query = urlparse(url_link).query   # urlparse(link) breaks the URL into parts. AND .query gives you the part after the ?, which is: v=xDQL3vWwcp0&list=someListId

  params = parse_qs(query) # parse_qs(...) turns the query into a dictionary: {'v': ['xDQL3vWwcp0'], 'list': ['someListId']}
  return params.get('v',[''])[0] # .get('v', ['']) tries to find the value for 'v' (video ID).
  # "If 'v' is not found in the dictionary, give me [''] as a default."

In [8]:
video_id = get_video_id_2('https://www.youtube.com/watch?v=xDQL3vWwcp0&list=PL49M3zg4eCviRD4-hTjS5aUZs3PzAFYkJ&index=2')
print(video_id)

xDQL3vWwcp0


In [9]:
transcript = YouTubeTranscriptApi.get_transcript(video_id)

In [10]:
print(transcript) # we got a list of transcripts

[{'text': "hello everyone I am Santi and I'm", 'start': 0.16, 'duration': 3.559}, {'text': 'currently working as an ml engineer', 'start': 1.839, 'duration': 3.48}, {'text': 'today we are going to do an awesome', 'start': 3.719, 'duration': 3.241}, {'text': 'machine learning project that you can', 'start': 5.319, 'duration': 4.44}, {'text': 'add to your resume and impress all the', 'start': 6.96, 'duration': 4.92}, {'text': "interviewers um I'm dedicated to", 'start': 9.759, 'duration': 4.201}, {'text': 'teaching machine learning to all of you', 'start': 11.88, 'duration': 4.36}, {'text': 'and to ensure that you all learned an ml', 'start': 13.96, 'duration': 5.319}, {'text': 'job as soon as possible okay so without', 'start': 16.24, 'duration': 5.32}, {'text': "any further Ado let's Dive Right", 'start': 19.279, 'duration': 6.641}, {'text': 'In so our title is going to be um so our', 'start': 21.56, 'duration': 6.479}, {'text': 'project title is going to be uh YouTube', 'start': 25.92

[line['text'] for line in transcript]
Means:

Go through each line in the transcript and collect only the line['text'] part.

Result:

python
Copy
Edit
['Hello and welcome', 'Today we learn Python', 'Let’s get started']

In [11]:
# Join the transcript , make it paragraph
transcript_joined = " ".join([line['text'] for line in transcript])

In [12]:
transcript_joined

"hello everyone I am Santi and I'm currently working as an ml engineer today we are going to do an awesome machine learning project that you can add to your resume and impress all the interviewers um I'm dedicated to teaching machine learning to all of you and to ensure that you all learned an ml job as soon as possible okay so without any further Ado let's Dive Right In so our title is going to be um so our project title is going to be uh YouTube video summarizer with llm well so what's going to do is going to take in a video and you can ask any question regarding the videos contents the images Etc and that's going to be really really good really useful as well and that can be a mini deployed project as well okay so first things first um let's let's let's think about why do we need this in the first place the interviewer might say why not just use chat GPT right because chat gpg has now enabled the search web feature for every free user right so you might be wondering why not just use

There is no punctuation, so its just a words. So to make it meaningful paragraph we need to pu punctation. For this we use rpunct library

In [13]:
!pip install git+https://github.com/babthamotharan/rpunct.git@patch-2

Collecting git+https://github.com/babthamotharan/rpunct.git@patch-2
  Cloning https://github.com/babthamotharan/rpunct.git (to revision patch-2) to /tmp/pip-req-build-6mmwv7nu
  Running command git clone --filter=blob:none --quiet https://github.com/babthamotharan/rpunct.git /tmp/pip-req-build-6mmwv7nu
  Running command git checkout -b patch-2 --track origin/patch-2
  Switched to a new branch 'patch-2'
  Branch 'patch-2' set up to track remote branch 'patch-2' from 'origin'.
  Resolved https://github.com/babthamotharan/rpunct.git to commit a87b93410ca782657abb4e34df9159e6e47ac9ec
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting langdetect>=1.0.9 (from rpunct==1.0.2)
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting simpletransformers>=0.61.4 (from rpunct==1.0.2)
  Downloading simpletran

What does it do behind the scenes:
It uses the hugging face model afrom the library and that model is pretrained in taking bunch of textx and put the punctuations

In [14]:
from rpunct import RestorePuncts
rpunct = RestorePuncts()

ImportError: cannot import name 'DummyObject' from 'transformers' (/usr/local/lib/python3.11/dist-packages/transformers/__init__.py)

Why This Happens
rpunct internally uses simpletransformers

simpletransformers expects transformers to have DummyObject, which was removed in newer versions

So you need to stick to an older transformers version that still supports it


### So we are using other model

In [15]:
pip install deepmultilingualpunctuation


Collecting deepmultilingualpunctuation
  Downloading deepmultilingualpunctuation-1.0.1-py3-none-any.whl.metadata (4.0 kB)
Downloading deepmultilingualpunctuation-1.0.1-py3-none-any.whl (5.4 kB)
Installing collected packages: deepmultilingualpunctuation
Successfully installed deepmultilingualpunctuation-1.0.1


In [16]:
from deepmultilingualpunctuation import PunctuationModel

# Load model
model = PunctuationModel()

config.json:   0%|          | 0.00/892 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/406 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Device set to use cpu


In [17]:
from google.colab import userdata
userdata.get('HF_TOKEN')

'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'

In [18]:
punctuated_text = model.restore_punctuation(transcript_joined)

In [19]:
print(punctuated_text)

hello everyone. I am Santi and I'm currently working as an ml engineer. today, we are going to do an awesome machine learning project that you can add to your resume and impress all the interviewers. um, I'm dedicated to teaching machine learning to all of you and to ensure that you all learned an ml job as soon as possible. okay, so, without any further Ado, let's Dive Right In. so our title is going to be um, so our project title is going to be, uh, YouTube video summarizer with llm. well, so what's going to do is going to take in a video and you can ask any question regarding the videos contents, the images, Etc. and that's going to be really really good, really useful as well, and that can be a mini deployed project as well. okay, so, first things, first. um, let's, let's, let's think about why do we need this in the first place? the interviewer might say: why not just use chat GPT? right, because chat gpg has now enabled the search web feature for every free user, right? so you mi

In [20]:
!pip install groq --upgrade

Collecting groq
  Downloading groq-0.29.0-py3-none-any.whl.metadata (16 kB)
Downloading groq-0.29.0-py3-none-any.whl (130 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m130.8/130.8 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: groq
Successfully installed groq-0.29.0


In [21]:
from groq import Groq

# ✅ Use your actual API key here
api_key_groq = userdata.get('GROQ_YOUTUBE_COLLAB')

client = Groq(api_key=api_key_groq)

# Make a request
response = client.chat.completions.create(
    model="llama3-8b-8192",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},

    ]
)

# Print response
print(response.choices[0].message.content)

I'd be happy to help you. How can I assist you today? Do you have a specific question, task, or topic you'd like to discuss? I'm here to listen and provide information to help you achieve your goals.


In [35]:
prompt = f'"Answer based on this paragraph: \n text = "{punctuated_text}. What is generally taught here?'

In [23]:
print(prompt)

"Summarize this text: 
 text = "hello everyone. I am Santi and I'm currently working as an ml engineer. today, we are going to do an awesome machine learning project that you can add to your resume and impress all the interviewers. um, I'm dedicated to teaching machine learning to all of you and to ensure that you all learned an ml job as soon as possible. okay, so, without any further Ado, let's Dive Right In. so our title is going to be um, so our project title is going to be, uh, YouTube video summarizer with llm. well, so what's going to do is going to take in a video and you can ask any question regarding the videos contents, the images, Etc. and that's going to be really really good, really useful as well, and that can be a mini deployed project as well. okay, so, first things, first. um, let's, let's, let's think about why do we need this in the first place? the interviewer might say: why not just use chat GPT? right, because chat gpg has now enabled the search web feature for e

In [27]:
print(len(punctuated_text))
print(len(prompt))

11007
11039


In [36]:
response = client.chat.completions.create(
    model="llama3-8b-8192",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt},
    ]
)

print(response.choices[0].message.content)

Based on the paragraph, the topic that is generally taught here is binary classification, specifically the derivation and math related to binary classification, which includes concepts such as likelihood function, sigmoid function, gradient descent, optimal parameters, and how to determine the equation in Theta.


In [37]:
print(len(response.choices[0].message.content))

313
