<a href="https://colab.research.google.com/github/getcher123/YouTube-Subtitle-Extractor/blob/main/youtube_subtitle_extractor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# YouTube Subtitle Extractor

This script extracts subtitles of YouTube videos in English and Russian languages, cleans the text and saves them into an Excel file for parallel corpus creation.

## Prerequisites

The following packages are required to run the script:
- pandas
- re
- os
- itertools
# - YouTubeTranscriptApi (Install using `!pip install youtube_transcript_api`)

## Usage

1. Set the `video_ids` and `channelId` variables to the list of video IDs and YouTube channel ID for which you want to extract subtitles.
2. Run the script in your preferred Python environment.

The script will extract subtitles for the specified video IDs, clean the text, compare English and Russian subtitles to remove any discrepancies, and save the final result in an Excel file in the specified directory.

The saved Excel file will have two columns:
- `en` - English subtitle
- `ru` - Russian subtitle

## Notes

- If a video doesn't have English or Russian subtitles, it will be skipped.
- The script splits subtitles by sentence boundaries and cleans the text by removing unnecessary characters such as `..., “, ’, etc`.
- To remove discrepancies between English and Russian subtitles, the script compares the timestamps in the subtitles and deletes the sentence that doesn't have a matching timestamp in the other language. If there are multiple discrepancies, it may leave some of them unpaired.

In [2]:
# Mount my Google Drive (storage)
from google.colab import drive
drive.mount('/content/gdrive')

# data dir
import os
data_dir = '/content/gdrive/MyDrive/subtitles'  # Your data directory in Colab 
os.listdir(data_dir)

Mounted at /content/gdrive


['snt_UCS1mEytYPPiOHtfe_zqvKWg_E21kilDE8jY.xlsx',
 'snt_UCS1mEytYPPiOHtfe_zqvKWg_6NK70E9WfY0.xlsx',
 'snt_UCS1mEytYPPiOHtfe_zqvKWg_TZIKanZqUtk.xlsx',
 'snt_UCS1mEytYPPiOHtfe_zqvKWg_k7aGLisVvOA.xlsx',
 'snt_UCS1mEytYPPiOHtfe_zqvKWg_LyPnYuawOJY.xlsx',
 'snt_UCS1mEytYPPiOHtfe_zqvKWg_AxV0amhuE4s.xlsx',
 'snt_UCS1mEytYPPiOHtfe_zqvKWg_K2lMMkKk1Hg.xlsx',
 'snt_UCS1mEytYPPiOHtfe_zqvKWg_gkftlwflhss.xlsx',
 'snt_UCS1mEytYPPiOHtfe_zqvKWg_6ttPvTHia_Y.xlsx',
 'snt_UCS1mEytYPPiOHtfe_zqvKWg_lIqdaP4HSB8.xlsx',
 'snt_UCS1mEytYPPiOHtfe_zqvKWg_QgnIVcTvjfg.xlsx',
 'snt_UCS1mEytYPPiOHtfe_zqvKWg_qiJfTa4C6HI.xlsx',
 'snt_UCS1mEytYPPiOHtfe_zqvKWg_VYBYgIvFfCI.xlsx',
 'snt_UCS1mEytYPPiOHtfe_zqvKWg_NfMxbPQKr3g.xlsx',
 'snt_UCS1mEytYPPiOHtfe_zqvKWg_y1497ksjLWk.xlsx',
 'snt_UCS1mEytYPPiOHtfe_zqvKWg_e6AaXosgiYM.xlsx',
 'snt_UCS1mEytYPPiOHtfe_zqvKWg_-UgFOrP8D8Q.xlsx',
 'snt_UCS1mEytYPPiOHtfe_zqvKWg_40isgJFwMGg.xlsx',
 'snt_UCS1mEytYPPiOHtfe_zqvKWg_Vb7dKjAOOQ0.xlsx',
 'snt_UCS1mEytYPPiOHtfe_zqvKWg_8b_xmMGCiPg.xlsx',


In [3]:
# install libraries
!pip install google-api-python-client
!pip install youtube_transcript_api
!pip install openai
# importing libraries 
import json
from googleapiclient.discovery import build # Google API request
from youtube_transcript_api import YouTubeTranscriptApi
import os
import pandas as pd
import re
from itertools import zip_longest

# Enter your YouTube api key
api_key = 'AIzaSyA_mcs0kI1vKVFt60TuNLrqSflXk2LDhdc' 
# If the address of the Youtube is https://youtu.be/zOjov-2OZ0E then, the video id is tha last part "zOjov-2OZ0E".

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting youtube_transcript_api
  Downloading youtube_transcript_api-0.6.0-py3-none-any.whl (23 kB)
Installing collected packages: youtube_transcript_api
Successfully installed youtube_transcript_api-0.6.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai
  Downloading openai-0.27.6-py3-none-any.whl (71 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.9/71.9 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
Collecting aiohttp
  Downloading aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m31.0 MB/s[0m eta [36m0:00:00[0m
Collecting aiosignal>=1.1.2
  Downloading aiosignal-1.3.1-py3-none-any

# Get all videos ID from channel

In [None]:
channelId = "UCBobmJyzsJ6Ll7UbfhI4iwQ"
youtube = build('youtube','v3',developerKey= api_key)

# getting all video details
contentdata = youtube.channels().list(id=channelId,part='contentDetails').execute()
playlist_id = contentdata['items'][0]['contentDetails']['relatedPlaylists']['uploads']
videos = []
next_page_token = None

while 1:
    res = youtube.playlistItems().list(playlistId=playlist_id,part='snippet',maxResults=50,pageToken=next_page_token).execute()
    videos += res['items']
    next_page_token = res.get('nextPageToken')
    if next_page_token is None:
        break

# getting video id for each video
video_ids = list(map(lambda x:x['snippet']['resourceId']['videoId'], videos))
# video_ids = video_ids[:4]
video_ids

# Filter videos ids by Language and save to GD

In [None]:
# Create a new directory if it doesn't exist
def checkLanguages(video_id):
    print(f'Checking {video_id}')
    try:
        subtitles_en = YouTubeTranscriptApi.get_transcript(video_id, languages=['en'])
        subtitles_ru = YouTubeTranscriptApi.get_transcript(video_id, languages=['ru'])
    except:
        return  False
    print(f'Languages find in {video_id}')
    return True       

video_ids = [el for el in video_ids if checkLanguages(el)]

directory = os.path.join(data_dir, channelId)

if not os.path.exists(directory):
    os.makedirs(directory)

file_path = os.path.join(directory, 'video_list.xlsx')
df = pd.DataFrame(video_ids, columns=['video_ids'])
df.to_excel(file_path, index=False)

print(f"List saved as {file_path}")

# Load videos ID from GD

In [7]:
import pandas as pd

file_path = "/content/gdrive/MyDrive/subtitles/UCBobmJyzsJ6Ll7UbfhI4iwQ/video_list.xlsx"
channelId = "Epic"

# Read the Excel file into a pandas DataFrame
df = pd.read_excel(file_path)

# Convert the DataFrame column to a list
video_ids = df['video_ids'].tolist()

print(video_ids)

['h_dJtk3BCyg', 'A1Nvl2MD30U', 'alQEf454PjU', 'r1fHOS4XaeE', 'Wc6lUXOhRO0', 'f9q8A-9DvPo', 'u06GAVxyIag', 'Fj1zCsYydD8', 'mUCitodzZxI', 'Dmh5a_ddO58', 'QXuHzH0IyRE', 'Itd677YZi50', 'usJrcwN6T4I', 'xLVJP-o0g28', '7ZLibi6s_ew', '9qh8HPHjCAw', 'WU0gvPcc3jQ']


# Get subtitles by lines

In [None]:
# list to hold all prompts
prompts = []

# iterate over each video ID
for video_id in video_ids:

    try:
        subtitles_en = YouTubeTranscriptApi.get_transcript(video_id, languages=['en'])
        subtitles_ru = YouTubeTranscriptApi.get_transcript(video_id, languages=['ru'])
    except Exception as e:
        print(f"Exception occurred: {e}")
        continue
    list_sub_en = []
    list_sub_ru = []
    # iterate over each subtitle entry and add to prompts list
    for sub_en, sub_ru in zip(subtitles_en, subtitles_ru):
        text_en = sub_en['text']
        text_ru = sub_ru['text']
        list_sub_en.append(text_en)
        list_sub_ru.append(text_ru)
    max_len = max(len(list_sub_en), len(list_sub_en))
    data = list(zip_longest(list_sub_en, list_sub_en, fillvalue=None))
    df = pd.DataFrame(data, columns=['en', 'ru'])
    sufix = ""
    if len(list_sub_en) != len(list_sub_ru): sufix = f"!error_{len(list_sub_en)}-{len(list_sub_ru)}_"
    with pd.ExcelWriter(os.path.join(data_dir, f'{sufix}lines_{channelId}_{video_id}.xlsx')) as writer:
        df.to_excel(writer, index=False)
    
#sentences = combined_sub.split('\n')
#combined_subtitles.extend(sentences)


# create dataframe from prompts list
#df = pd.DataFrame(prompts, columns=['en', 'ru'])

# save dataframe to Excel file
#with pd.ExcelWriter(os.path.join(data_dir, f'{channelId}.xlsx')) as writer:
#    df.to_excel(writer, index=False)

# Get subtitles group by sentences with timecodes 

In [6]:
import pandas as pd
import re
import os
from itertools import zip_longest

def extract_numbers(s):
    pattern = r'\[(\d+\.\d+)\]'
    numbers = re.findall(pattern, s)
    cleaned_s = re.sub(pattern, '', s)
    return numbers, cleaned_s

def compareLists(list1, list2):
  i = 0
  while i < min(len(list1), len(list2)):
    numbers1_1, cleaned_s1_1 = extract_numbers(list1[i])
    numbers2_1, cleaned_s2_1 = extract_numbers(list2[i])
    if numbers1_1 == numbers2_1:
        list1[i] = cleaned_s1_1
        list2[i] = cleaned_s2_1
        i += 1
        continue
        
    if i < min(len(list1), len(list2)) :
        try:
          numbers1_2, cleaned_s1_2 = extract_numbers(list1[i+1])
        except:
          numbers1_2 = ""
        try:
          numbers2_2, cleaned_s2_2 = extract_numbers(list2[i+1])
        except:
          numbers2_2 = ""
        if numbers1_1 == numbers2_2:
            del list2[i]
        elif numbers2_1 == numbers1_2:
            del list1[i]
        else:
            del list1[i]
            del list2[i]
  return list1, list2

# list to hold all prompts
prompts = []

# iterate over each video ID
for video_id in video_ids:
    print(video_id)

    try:
        subtitles_en = YouTubeTranscriptApi.get_transcript(video_id, languages=['en'])
        subtitles_ru = YouTubeTranscriptApi.get_transcript(video_id, languages=['ru'])
    except:
        # skip video if it doesn't have English or Russian subtitles
        continue
    combined_sub_en = ""
    combined_sub_ru = ""
    # iterate over each subtitle entry and add to prompts list
    for sub_en, sub_ru in zip(subtitles_en, subtitles_ru):
        text_en = sub_en['text']
        time_en = sub_en['start']

        text_ru = sub_ru['text']
        time_ru = sub_ru['start']

        text_en = re.split(r'(?<=[!?.])+(?=[A-ZА-Я])', text_en)
        text_en = " ".join(text_en)
        text_ru = re.split(r'(?<=[!?.])+(?=[A-ZА-Я])', text_ru)
        text_ru = " ".join(text_ru)
        text_en = re.sub(r'\{\an\d+\}\s*', '', text_en)
        text_ru = re.sub(r'\{\an\d+\}\s*', '', text_ru)
        text_en = re.sub(r'\[.*?\]', '', text_en)
        text_ru = re.sub(r'\[.*?\]', '', text_ru)
        text_en = re.sub(r'[A-ZА-Я ]+:', '', text_en)
        text_ru = re.sub(r'[A-ZА-Я ]+:', '', text_ru)

        combined_sub_en += (text_en.replace('\xa0', ' ').replace('\n', ' ').replace('...', ',').replace('…', ',').replace('"', '').replace('\'', '') + ' ').replace('  ', ' ').replace('  ', ' ') + f"[{time_en}]"
        combined_sub_ru += (text_ru.replace('\xa0', ' ').replace('\n', ' ').replace('...', ',').replace('…', ',').replace('«', '').replace('»', '').replace('"', '') + ' ').replace('  ', ' ').replace('  ', ' ') + f"[{time_ru}]"

#        sentences_sub_en = re.split(r'(?<![!?\.])[!?.]\s', combined_sub_en)
#        sentences_sub_ru = re.split(r'(?<![!?\.])[!?.]\s', combined_sub_ru)

    sentences_sub_en = re.split(r'(?<=[!?.])\s+(?=[A-ZА-Я])', combined_sub_en)
    sentences_sub_ru = re.split(r'(?<=[!?.])\s+(?=[A-ZА-Я])', combined_sub_ru)
    
#    sentences_sub_en, sentences_sub_ru = compareLists(sentences_sub_en, sentences_sub_ru)


    sentences_sub_en = [elem for elem in sentences_sub_en if elem]
    sentences_sub_ru = [elem for elem in sentences_sub_ru if elem]

    print(video_id)
    print(len(sentences_sub_en))
    print(len(sentences_sub_ru))
    print("------------------")

    max_len = max(len(sentences_sub_en), len(sentences_sub_ru))
    data = list(zip_longest(sentences_sub_en, sentences_sub_ru, fillvalue=None))
    df = pd.DataFrame(data, columns=['en', 'ru'])
    sufix = ""
    if len(sentences_sub_en) != len(sentences_sub_ru): sufix = f"!error_{len(sentences_sub_en)}-{len(sentences_sub_ru)}_"
    with pd.ExcelWriter(os.path.join(data_dir, f'{sufix}snt_{channelId}_{video_id}.xlsx')) as writer:
        df.to_excel(writer, index=False)
    
#sentences = combined_sub.split('\n')
#combined_subtitles.extend(sentences)


# create dataframe from prompts list
#df = pd.DataFrame(prompts, columns=['en', 'ru'])

# save dataframe to Excel file
#with pd.ExcelWriter(os.path.join(data_dir, f'{channelId}.xlsx')) as writer:
#    df.to_excel(writer, index=False)

h_dJtk3BCyg
h_dJtk3BCyg
2
1
------------------


NameError: ignored

# Get subtitles group by sentences with embeddings 

In [22]:
import pandas as pd
import re
import os
import openai
from scipy.spatial.distance import cosine
from openai.embeddings_utils import get_embedding
from itertools import zip_longest


def get_text_similarity(text1, text2):
    openai.api_key = "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"  # Replace with your own OpenAI API key
    model_engine = "text-embedding-ada-002"  # Change to the desired OpenAI language model

    # Generate embeddings for the two pieces of text using OpenAI's language model
    embeddings1 = get_embedding(text1, model_engine)
    embeddings2 = get_embedding(text2, model_engine)

    # Calculate the cosine distance between the two embeddings using Scipy's cosine distance function
    similarity = 1 - cosine(embeddings1, embeddings2)

    return similarity

def compareLists(list1, list2):
  i = 0
  while i < min(len(list1), len(list2)):
    print(i)
    similarity = get_text_similarity(list1[i], list2[i])
    if similarity > 0.81:
        i += 1
        continue
    
    if i < min(len(list1), len(list2)) :
        try:
          similarity1 = get_text_similarity(list1[i], list2[i+1])
        except:
          similarity1 = 0
        try:
          similarity2 = get_text_similarity(list1[i+1], list2[i])
        except:
          similarity2 = 0
        if similarity1 > 0.81:
            del list2[i]
        elif similarity2 > 0.81:
            del list1[i]
        else:
            del list1[i]
            del list2[i]
  return list1, list2

# list to hold all prompts
prompts = []
video_ids = ["WU0gvPcc3jQ"]
channelId = "channel"


# iterate over each video ID
for video_id in video_ids:
    print(video_id)

    try:
        subtitles_en = YouTubeTranscriptApi.get_transcript(video_id, languages=['en'])
        subtitles_ru = YouTubeTranscriptApi.get_transcript(video_id, languages=['ru'])
    except:
        # skip video if it doesn't have English or Russian subtitles
        continue
    combined_sub_en = ""
    combined_sub_ru = ""
    # iterate over each subtitle entry and add to prompts list
    for sub_en, sub_ru in zip(subtitles_en, subtitles_ru):
        text_en = sub_en['text']
        time_en = sub_en['start']

        text_ru = sub_ru['text']
        time_ru = sub_ru['start']

        text_en = re.split(r'(?<=[!?.])+(?=[A-ZА-Я])', text_en)
        text_en = " ".join(text_en)
        text_ru = re.split(r'(?<=[!?.])+(?=[A-ZА-Я])', text_ru)
        text_ru = " ".join(text_ru)
        text_en = re.sub(r'\{\an\d+\}\s*', '', text_en)
        text_ru = re.sub(r'\{\an\d+\}\s*', '', text_ru)
        text_en = re.sub(r'\[.*?\]', '', text_en)
        text_ru = re.sub(r'\[.*?\]', '', text_ru)
        text_en = re.sub(r'[A-ZА-Я ]+:', '', text_en)
        text_ru = re.sub(r'[A-ZА-Я ]+:', '', text_ru)
        text_en = text_en.replace('Hi!', 'Hi,')
        text_ru = text_ru.replace('Привет!', 'Привет,')
        text_en = text_en.replace('Hi.', 'Hi,')
        text_ru = text_ru.replace('Привет.', 'Привет,')
        combined_sub_en += (text_en.replace('\xa0', ' ').replace('\n', ' ').replace('...', ',').replace('…', ',').replace('"', '').replace('\'', '') + ' ').replace('  ', ' ').replace('  ', ' ')
        combined_sub_ru += (text_ru.replace('\xa0', ' ').replace('\n', ' ').replace('...', ',').replace('…', ',').replace('«', '').replace('»', '').replace('"', '') + ' ').replace('  ', ' ').replace('  ', ' ')

#        sentences_sub_en = re.split(r'(?<![!?\.])[!?.]\s', combined_sub_en)
#        sentences_sub_ru = re.split(r'(?<![!?\.])[!?.]\s', combined_sub_ru)

    sentences_sub_en = re.split(r'(?<=[!?.])\s+(?=[A-ZА-Я])', combined_sub_en)
    sentences_sub_ru = re.split(r'(?<=[!?.])\s+(?=[A-ZА-Я])', combined_sub_ru)
    
    sentences_sub_en, sentences_sub_ru = compareLists(sentences_sub_en, sentences_sub_ru)

    sentences_sub_en = [elem for elem in sentences_sub_en if elem]
    sentences_sub_ru = [elem for elem in sentences_sub_ru if elem]

    print(video_id)
    print(len(sentences_sub_en))
    print(len(sentences_sub_ru))
    print("------------------")

    max_len = max(len(sentences_sub_en), len(sentences_sub_ru))
    data = list(zip_longest(sentences_sub_en, sentences_sub_ru, fillvalue=None))
    df = pd.DataFrame(data, columns=['en', 'ru'])
    sufix = ""
    if len(sentences_sub_en) != len(sentences_sub_ru): sufix = f"!error_{len(sentences_sub_en)}-{len(sentences_sub_ru)}_"
    with pd.ExcelWriter(os.path.join(data_dir, f'{sufix}snt_{channelId}_{video_id}.xlsx')) as writer:
        df.to_excel(writer, index=False)
    
#sentences = combined_sub.split('\n')
#combined_subtitles.extend(sentences)


# create dataframe from prompts list
#df = pd.DataFrame(prompts, columns=['en', 'ru'])

# save dataframe to Excel file
#with pd.ExcelWriter(os.path.join(data_dir, f'{channelId}.xlsx')) as writer:
#    df.to_excel(writer, index=False)

WU0gvPcc3jQ
0
1
2
3
4
5
6
6
7
8
9
10
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
25
25
26
27
28
29
30
31
32
33
34
35
35
36
37
38
39
39
40
40
41
41
42
42
43
43
44
45
46
47
48
49
50
51
52
53
54
55
55
56
57
58
58
59
59
60
61
61
61
62
WU0gvPcc3jQ
62
63
------------------



# Further test cells, they are not needed

In [None]:
list1 = ['apple', 'banana [1.1]', 'cherry[2.1]',  'banana1', 'banana2',       'apple1[1.23]',  'banana1[12.2]', 'banana5[1.22]']
list2 = ['apple', 'banana [1.1]', 'cherry[2.2]',             'banana3',       'apple2[1.23]',  'banana2[12.3]', 'banana2', 'banana6[1.22]']

def compareLists(list1, list2):
  i = 0
  while i < min(len(list1), len(list2)):
    numbers1_1, cleaned_s1_1 = extract_numbers(list1[i])
    numbers2_1, cleaned_s2_1 = extract_numbers(list2[i])
    print(i)
    if numbers1_1 == numbers2_1:
        print(f"{list1[i] = } {list2[i] = }")
        list1[i] = cleaned_s1_1
        list2[i] = cleaned_s2_1
        i += 1
        continue
        
    del list1[i]
    del list2[i]
    if i < min(len(list1), len(list2)) :
        numbers1_1, cleaned_s1_1 = extract_numbers(list1[i])
        numbers2_1, cleaned_s2_1 = extract_numbers(list2[i])
        try:
          numbers1_2, cleaned_s1_2 = extract_numbers(list1[i+1])
        except:
          numbers1_2 = ""
        try:
          numbers2_2, cleaned_s2_2 = extract_numbers(list2[i+1])
        except:
          numbers2_2 = ""
        if numbers1_1 == numbers2_2:
            del list2[i]
        elif numbers2_1 == numbers1_2:
            del list1[i]
  return list1, list2
        
list1, list2 = compareLists(list1, list2)

print(list1)
print(list2)

0
list1[i] = 'apple' list2[i] = 'apple'
1
list1[i] = 'banana [1.1]' list2[i] = 'banana [1.1]'
2
2
list1[i] = 'banana2' list2[i] = 'banana3'
3
list1[i] = 'apple1[1.23]' list2[i] = 'apple2[1.23]'
4
4
list1[i] = 'banana5[1.22]' list2[i] = 'banana6[1.22]'
['apple', 'banana ', 'banana2', 'apple1', 'banana5']
['apple', 'banana ', 'banana3', 'apple2', 'banana6']


0.7760193696146548
