# Get YouTube video transcripts

Create a dataset with the transcripts of YouTube videos from the [Awesome Nature](https://www.youtube.com/playlist?list=PLD018AC9B25A23E16) playlist from [TED-Ed](https://www.youtube.com/@TEDEd) channel.

## Step 1. Get video URLs


In [1]:
import requests
import re
import pandas as pd

playlist_url = "https://www.youtube.com/playlist?list=PLD018AC9B25A23E16"
headers = {
    'Accept-Language': 'en-US'
}
response = requests.get(playlist_url, headers=headers)
html_content = response.text

title_pattern = r'"title":\{"runs":\[\{"text":"(.*?)"\}\]'
video_id_pattern = r'\{"webCommandMetadata":\{"url":"\/watch\?v=(.*?)\\u0026list'

titles = []
for title_match in re.finditer(title_pattern, html_content):
    extracted_title = title_match.group(1)
    titles.append(extracted_title)

video_ids = []
for url_match in re.finditer(video_id_pattern, html_content):
    extracted_id = url_match.group(1)
    video_ids.append(extracted_id)

titles_series = pd.Series(titles, dtype="string")
video_ids_series = pd.Series(video_ids, dtype="string")

df = pd.DataFrame.from_dict({"video_id": video_ids_series, "title":  titles_series})
# Limit to 100 entries
df = df[:100]
print(df.head())
df.info()


      video_id                                              title
0  W9wAfqBd_T0  How turtle shells evolved... twice - Judy Cebr...
1  Cd-artSbpXc          Why are fish fish-shaped? - Lauren Sallan
2  _hBAr7uJ6L8  The surprising reasons animals play dead - Tie...
3  uSTNyHkde08  Why isn't the world covered in poop? - Eleanor...
4  -64U7WoBrqM             Why are sloths so slow? - Kenny Coogan
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   video_id  100 non-null    string
 1   title     100 non-null    string
dtypes: string(2)
memory usage: 1.7 KB


## Step 2. Get video transcripts

In [2]:
from youtube_transcript_api import YouTubeTranscriptApi

transcripts = []
for _, row in df.iterrows():
    print(f"Fetching transcript for video \"{row['title']}\"")
    transcript_list = YouTubeTranscriptApi.get_transcript(row["video_id"])
    output = []
    for text in transcript_list:
        output.append(text["text"])
    transcripts.append(" ".join(output))

df["transcript"] = transcripts



Fetching transcript for video "How turtle shells evolved... twice - Judy Cebra Thomas"
Fetching transcript for video "Why are fish fish-shaped? - Lauren Sallan"
Fetching transcript for video "The surprising reasons animals play dead - Tierney Thys"
Fetching transcript for video "Why isn't the world covered in poop? - Eleanor Slade and Paul Manning"
Fetching transcript for video "Why are sloths so slow? - Kenny Coogan"
Fetching transcript for video "The evolution of animal genitalia - Menno Schilthuizen"
Fetching transcript for video "Meet the tardigrade, the toughest animal on Earth - Thomas Boothby"
Fetching transcript for video "Why do we harvest horseshoe crab blood? - Elizabeth Cox"
Fetching transcript for video "The surprising reason birds sing - Partha P. Mitra"
Fetching transcript for video "The life cycle of the butterfly - Franziska Bauer"
Fetching transcript for video "Cannibalism in the animal kingdom - Bill Schutt"
Fetching transcript for video "A simple way to tell insects

Export to files for reuse.


In [4]:
for i in [4, 8, 16, 32, 64, 100]:
    df[:i].to_csv(f"./transcripts/awesome_nature_{i}.csv")