## Getting text data from youtube videos

### Overview

This workbook provides a short illustration of how to download the transcripts for a series of youtube videos, store them in pandas dataframes, and run a few queries to show the kind of analysis you may be interested in doing in your own research. 

### Software

This workbook uses the following modules.

* Pandas
* PandaSQL
* youtube_transcript_api

### Background

In my consulting sessions with researchers, I've been getting more questions lately about how to extract text with time stamps from videos and images. This makes a certain amount of sense, with the amount of public data now posted online. 

Youtube often provides a text transcript (captions) along with time stamps. You can access this transcript without any programming through the youtube website. Here are instructions:

https://ccm.net/faq/40644-how-to-get-the-transcript-of-a-youtube-video

If you have a large number of videos, you may want to avoid a lot of manual downloading and formatting and use a python script. Fortunately, an open source module, "youtube_transcipt_api", provides an easy API for this task.

https://pypi.org/project/youtube-transcript-api/


#### Note - Getting text data from my own videos

If you have videos and don't mind making them public or unlisted, you can use this approach by uploading them to youtube and using the methods here. If you have a very large dataset, you might want to use a cloud storage and API solution. Various platforms provide this - here's a link to the google API.

https://cloud.google.com/video-intelligence/docs/text-detection

This takes a little more programming and configuration, and may result in some cloud computing charges depending depending on the amount of data you want to process, but it is probably more scalable and can offer a more secure environment for videos you want to keep private.  

### Sample Data

This workbook reads the text from a series of youtube videos,
formats them in a python dataframe, and queries them by timestamp and text strings.

For illustration, we'll use a series of lectures from "On Power and Politics in Today's World"

https://www.youtube.com/playlist?list=PLh9mgdi4rNeyViG2ar68jkgEi4y6doNZy"

### Install and import the youtube_transcript_api

You'll need to install the module before you can import it. You only have to do this once on your system (even if you use it in a different notebook or python script), so you may want to comment out or remove this line after running it once. 

In [1]:
#!pip install youtube_transcript_api

In [2]:
from youtube_transcript_api import YouTubeTranscriptApi
import pandas as pd

In [3]:
pd.options.display.max_columns = None
pd.options.display.max_rows = None
pd.options.display.max_colwidth = None

### Extract the transcript from one video

We'll extract the transcript for each video using the YouTubeTransciptAPI get_transcript() method. This method takes the video ID as a parameter.

You can get the video ID from the URL on youtube - for example, https://www.youtube.com/watch?v=BDqvzFY72mg has the ID 'BDqvzFY72mg'

In [4]:
transcript = YouTubeTranscriptApi.get_transcript('BDqvzFY72mg')

The method returns the transcript of the video as a list of lines, each stored as a dictionary. We get 1280 lines from the video above.

In [5]:
print(type(transcript))
print(len(transcript))

<class 'list'>
912


Let's look at the first few lines. Each line contains a dictionary with keys "text", "start", and "duratation". 

In [6]:
transcript[:3]

[{'text': '- Hello everybody and welcome.', 'start': 8.06, 'duration': 2.66},
 {'text': 'How is everybody today?', 'start': 10.72, 'duration': 1.613},
 {'text': 'Great.', 'start': 13.404, 'duration': 0.916}]

You can parse the video using standard techiques for JSON or dictionaries (more info here: https://github.com/geoffswc/Python-JSON-Workshop).

### Transcripts for Multiple Videos in Pandas Format

Fortunately, this is a flat dictionary structure, not deeply nested, so we can convert this to a pandas dataframe easily. In this next section, we'll review code to convert a series of videos and concatenate them into a single data frame.

First, we'll greate a list of IDs for each video. 

In [7]:
links = [
    'BDqvzFY72mg',
    'f5nbT4xQqwI',
    's48b9B5gd88',
    '4eUS8trd_yI',
    'aKW_Vsk4hzs',
    'q53DF6ySOZg',
    'T3-VlQu3iRM'
]

Extract the transcript for each video using the YouTubeTransciptAPI get_transcript() method. 

For now, we will store the transcript for each video in a list named transcripts. 

In [8]:
transcripts = []
for v in links:
    try:
        df = pd.DataFrame(YouTubeTranscriptApi.get_transcript(v))
        df['video_id'] = v
        transcripts.append(df)
    except:
        print(v, 'failed to translate')

aKW_Vsk4hzs failed to translate


Note that we have now created a list of pandas dataframes. Let's take a look at a few lines from the first one.

In [9]:
transcripts[0].head()

Unnamed: 0,text,start,duration,video_id
0,- Hello everybody and welcome.,8.06,2.66,BDqvzFY72mg
1,How is everybody today?,10.72,1.613,BDqvzFY72mg
2,Great.,13.404,0.916,BDqvzFY72mg
3,"Well, I'm delighted to\nhave the opportunity",14.32,3.54,BDqvzFY72mg
4,to be giving the DeVane Lectures.,17.86,2.92,BDqvzFY72mg


Next, we'll combine all the dataframes into a single dataframe.

In [10]:
df_transcripts = pd.concat(transcripts).reset_index(drop=True)

In [11]:
#df_transcripts.iloc[1000:1100]

### Run some queries

Now that we have our text in a single dataframe, we can analyze it using a wide range of tools in python. You might be interested in natural language processing, sentiment analysis, text classification, lexical structures, regional differences in language ussed in school board meeings... more than we can get into here (though feel free to get started here with the Library "Document Classification with Scikit-Learn" workshop aat https://courses.ucsf.edu/course/view.php?id=8249)

For now, we'll just query the data in a few ways and leave it there. If you've taken any of my workshops, you'll know I lean toward using SQL, so I'll write a very queries using the pandasql module. 

In [12]:
# !pip isntall pandasql 
from pandasql import sqldf 
pysqldf = lambda q: sqldf(q, globals())

In [13]:
# which videos have the most lines of text
pysqldf("SELECT video_id, COUNT(*) FROM df_transcripts GROUP BY video_id")

Unnamed: 0,video_id,COUNT(*)
0,4eUS8trd_yI,1300
1,BDqvzFY72mg,912
2,T3-VlQu3iRM,1280
3,f5nbT4xQqwI,1180
4,q53DF6ySOZg,1380
5,s48b9B5gd88,1286


In [14]:
# which videos were longest (highest timestamp + duration)
pysqldf("SELECT video_id, MAX(start + duration) FROM df_transcripts GROUP BY video_id")

Unnamed: 0,video_id,MAX(start + duration)
0,4eUS8trd_yI,4325.166
1,BDqvzFY72mg,3367.023
2,T3-VlQu3iRM,4234.644
3,f5nbT4xQqwI,4235.705
4,q53DF6ySOZg,4399.389
5,s48b9B5gd88,4423.037


In [15]:
# Most mentions of the Cold War
pysqldf("""
SELECT 
    video_id, 
    COUNT(1) 
FROM 
    df_transcripts 
WHERE 
    LOWER(text) LIKE ('%cold war%')
GROUP BY 
    video_id""")

Unnamed: 0,video_id,COUNT(1)
0,BDqvzFY72mg,8
1,T3-VlQu3iRM,1
2,f5nbT4xQqwI,2
3,q53DF6ySOZg,4
4,s48b9B5gd88,19


In [16]:
# what rows matched
pysqldf("""
SELECT 
    *
FROM 
    df_transcripts 
WHERE 
    LOWER(text) LIKE ('%cold war%')
""")

Unnamed: 0,text,start,duration,video_id
0,because of the Cold War.,113.46,1.49,BDqvzFY72mg
1,most of the conflicts within the Cold War,123.26,3.46,BDqvzFY72mg
2,at the end of the Cold War\nit was a defensive alliance,1828.73,2.37,BDqvzFY72mg
3,that came about as a\nbyproduct of the Cold War,2001.73,0.833,BDqvzFY72mg
4,since the Cold War.,3106.63,1.74,BDqvzFY72mg
5,who are both cold war historians,3120.3,2.18,BDqvzFY72mg
6,who are gonna be teaching\ncourses on the Cold War.,3122.48,3.24,BDqvzFY72mg
7,"from into war Europe II to\nthe end of the Cold War,",3130.6,5.0,BDqvzFY72mg
8,"We had arms races all\nthrough the Cold War,",1296.08,2.73,f5nbT4xQqwI
9,"but we're not, this is not\na course about the Cold War.",1582.25,3.543,f5nbT4xQqwI


In [17]:
# when was cold war first mentioned in each video?
pysqldf("""
SELECT 
    video_id, 
    MIN(start) 
FROM 
    df_transcripts 
WHERE 
    LOWER(text) LIKE ('%cold war%')
GROUP BY 
    video_id""")

Unnamed: 0,video_id,MIN(start)
0,BDqvzFY72mg,113.46
1,T3-VlQu3iRM,4213.11
2,f5nbT4xQqwI,1296.08
3,q53DF6ySOZg,952.98
4,s48b9B5gd88,12.91


In [18]:
#aKW_Vsk4hzs
transcript = pd.DataFrame(YouTubeTranscriptApi.get_transcript('q53DF6ySOZg'))

In [19]:
transcript

Unnamed: 0,text,start,duration
0,"- Okay, so today,",7.13,1.14
1,we're talking about the\nresurgent right in the West.,8.27,3.473
2,Let's get fix our intuitions,12.59,1.99
3,with a little walk down memory lane.,14.58,4.753
4,"- [Ronald Reagan] In Chicago,",21.29,0.87
5,they found a woman who holds the record.,22.16,2.39
6,"She used 80 names, 30\naddresses, 15 telephone numbers",24.55,4.326
7,"to collect food stamps, social security,",28.876,2.941
8,veteran's benefits for four non-existent,31.817,2.943
9,deceased veterans husbands\nas well as welfare.,34.76,3.57


In [20]:
' '.join(transcript['text'])



In [21]:
df_transcripts = df_transcripts.replace(r'\n',' ', regex=True)

In [22]:
df_transcripts.head(20)

Unnamed: 0,text,start,duration,video_id
0,- Hello everybody and welcome.,8.06,2.66,BDqvzFY72mg
1,How is everybody today?,10.72,1.613,BDqvzFY72mg
2,Great.,13.404,0.916,BDqvzFY72mg
3,"Well, I'm delighted to have the opportunity",14.32,3.54,BDqvzFY72mg
4,to be giving the DeVane Lectures.,17.86,2.92,BDqvzFY72mg
5,"And the DeVane Lectures, as you can tell,",20.78,2.86,BDqvzFY72mg
6,from looking around you double as being,23.64,3.83,BDqvzFY72mg
7,a regular Yale course for credit,27.47,3.31,BDqvzFY72mg
8,that students can take for credit,30.78,1.92,BDqvzFY72mg
9,and lectures that are open to the general public.,32.7,3.673,BDqvzFY72mg


In [23]:
df_transcripts.to_csv('transcripts.csv', index=False)