# How to download YouTube Subtitles for Semantic Brand Score Analysis

**The main steps in this notebook are:**

* Download the subtitles file(s)
* Convert to a text file
* Clean the text and remove duplications
* Export to a text file


**Subtitle Settings available via youtube-dl:**

* --write-sub (Write subtitle file)
* --write-auto-sub (Write automatic subtitle file (YouTube only))
* --all-subs (Download all the available subtitles of the video)
* --list-subs (List all available subtitles for the video)
* --sub-format (FORMAT Subtitle format, accepts formats preference, for example: "srt" or "ass/srt/best")
* --sub-lang LANGS (Languages of the subtitles to download (optional) separated by commas, use IETF language tags like 'en,pt')

In [None]:
#Download module to your Colab
!pip install youtube-dl

Collecting youtube-dl
  Downloading youtube_dl-2021.6.6-py2.py3-none-any.whl (1.9 MB)
[K     |████████████████████████████████| 1.9 MB 22.4 MB/s 
[?25hInstalling collected packages: youtube-dl
Successfully installed youtube-dl-2021.6.6


# Insert YouTube URL and Subtitles Settings

In [None]:
from __future__ import unicode_literals
import youtube_dl

ydl_opts = {
'writesubtitles': True,
'writeautomaticsub': True,
'subtitle': '--write-sub --sub-lang en',
'skip_download': True, 
}
url = input("Please add YouTube link here:")
with youtube_dl.YoutubeDL(ydl_opts) as ydl:
    ydl.download([url])
print("Download Successful!")

Please add YouTube link here:https://www.youtube.com/watch?v=bkEKImUIg60
[youtube] bkEKImUIg60: Downloading webpage
[info] Writing video subtitles to: Crypto News - Polygon, ETH, Evergrande, Crypto CEOs & MORE!!-bkEKImUIg60.en.vtt
Download Successful!


We can extract YouTube's automated subtitles.


# Convert Subtitles file to Text File

In [None]:
#https://github.com/glut23/webvtt-py
!pip install webvtt-py
import webvtt

Collecting webvtt-py
  Downloading webvtt_py-0.4.6-py3-none-any.whl (16 kB)
Installing collected packages: webvtt-py
Successfully installed webvtt-py-0.4.6


In [None]:
#Copy the path of thwe downloaded subtitles file here
file = "/content/Crypto News - Polygon, ETH, Evergrande, Crypto CEOs & MORE!!-bkEKImUIg60.en.vtt"
#Read the webvtt Subtitles file
vtt = webvtt.read(file) 
counter = 0
for i in range(len(vtt)):
  if counter < 10:
    print(vtt[i].text)
    counter+=1

 
[Music]
[Music]
 
[Music]
welcome to the coin bureau weekly crypto
welcome to the coin bureau weekly crypto
 
welcome to the coin bureau weekly crypto
review here are this week's top
review here are this week's top
 
review here are this week's top
headlines in the crypto news
headlines in the crypto news
 
headlines in the crypto news
[Music]
[Music]
 


# Remove duplicates from Subtitles

Above you should see duplicates if you have downloaded the automated subtitles.

The code below will remove duplicates but if there's no duplicates, it will just organise the text into one paragraph.

In [None]:
#I want to convert YouTube’s auto-generated subtitles into a plain transcript.
#https://stackoverflow.com/questions/51784232/how-do-i-convert-the-webvtt-format-to-plain-text

# import webvtt
# vtt = webvtt.read('subtitles.vtt')
transcript = ""

lines = []
for line in vtt:
    # Strip the newlines from the end of the text.
    # Split the string if it has a newline in the middle
    # Add the lines to an array
    lines.extend(line.text.strip().splitlines())

# Remove repeated lines
previous = None
for line in lines:
    if line == previous:
       continue
    transcript += " " + line
    previous = line

print(transcript)

# Save to Text File

In [None]:
text_file = open("Text_File_Name_Edit_Here.txt", "w")
n = text_file.write(transcript)
text_file.close()