# Convert video subtitles to text

Created by: Derek Robinson <br>
Last Updated: February 5, 2021

### Learning Objectives

In this Jupyter Notebook you will learn or review how to
<ul>
    <li> use the SRTtools library to extract text from a subtitle/captions document, </li>
    <li> read the number of elements in a list,</li>
    <li> use a 'for' loop, </li>
    <li> concatenate strings, </li>
    <li> write text to a file</li>
</ul>

### Problem Statement

In this notebook you will convert the subtitle file from recorded lecture content into a readable transcript. The transcript will not contain paragraph structure, but it will be legible and provide a written description of the recorded voice over content. 

### Lets get started

For this notebook we need only the `SRTtools` package. First try loading the packages into your r library and if you get an error then uncomment the install command and execute the cell to install the package.

In [2]:
#install.packages("SRTtools")
library(SRTtools)

In the next block of code you are to provide the filename of your `.srt` subtitle file and the name of the `outputFilename`. <br>
Precondition: the .srt file is located in the same folder as this notebook. <br>
Postcondition: the output file after the notebook is run will contain all the text from the .srt file.

In [3]:
srtFile = "LocalStatistics.mp4.srt"
outputFilename ="transcription.txt"
subtitles <- srt.read(srtFile, encoding = "utf-8")
#print(captions)

The `srt.content()` function extracts all the text content from the captions file, and places each string of text associated with a specific timestamp into an element within a list. In this case we will call that list `srtContent`. 

In [4]:
srtContent <- srt.content(subtitles)
#print(srtContent)
#print(srtContent[1])

Now that we have lines of text, we can loop through all the elements in the list and concatenate the text using the `paste()` function. Each element in the string contains data of `chr` type so we need to convert the element characters to a string using the `toString()` function. We also separate out the strings with a blank space `sep=" "` or else words that should have a space between them will be joined.

In [5]:
#srtTextJoin <- ""
for (i in 1:length(srtContent)) {
  srtTextJoin <- paste(srtTextJoin, toString(srtContent[i], sep=" ")) 
}
#print(srtTextJoin)


Lastly, we create a new document as a transcript of the `.srt` file.

In [6]:
srt.write(srtTextJoin, outputFilename)

# Congratulations! 

**You have reached the end of the subtitles-to-text notebook**