# TEDx Transcript Scraper

This scraper downloads the transcript of the talks present in the tedx_dataset.csv

The main dataset is composed by the following attributes:
- unique id
- details
- posted
- main_speaker
- event
- title
- num_views
- url

The transcript dataset is composed by the following attributes:
- unique id
- timestamp
- sentence

The notebook is organized with the following sections:

- Setup of the env (install libraries, set up variables and credentials, ...)
- Parse DOM of the web pages and download each single TEDx
- Store the data on CSV files

### Setup of the env

Install and import of python libraries 

In [None]:
!pip3 install selenium
!pip3 install pandas

In [None]:
import requests
import pprint
import pandas as pd
import time
from selenium import webdriver as wd
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import selenium
import json

This notebook uses Chrome Driver to simulate user interaction with TEDx.
To set up Chrome Driver on your laptop please refer to https://chromedriver.chromium.org/downloads


In [None]:
chromedriver_path =  '/Users/feder/Downloads/chromedriver/chromedriver'

In [None]:
def get_browser():
    chrome_options = wd.ChromeOptions()
    chrome_options.add_argument('log-level=3')
    browser = wd.Chrome(chromedriver_path, options=chrome_options)
    return browser

browser = get_browser()

# Get TEDx transcript

`get_transcript` function takes as input an entry of the talk dict and adds the transcript object, composed of timestamp and sentence

~~~~
{'main_speaker': 'Alexandra Auer',
  'url': 'https://www.ted.com/talks/alexandra_auer_the_intangible_effects_of_walls_apr_2020',
  'id': 1,
  ...
  'transcript' : {
    'timestamp': '00:04',
    'sentence': 'Humankind loves to build walls. Have you ever noticed that?...'
  },
  ...
}
~~~~


In [None]:
# sentences to skip
skippables = ['(Applause)', '(Laughter)', '(Laughs)', '(Inaudible)']

def get_transcript(my_tedx):
    if log:
        print("Current url: " + my_tedx['url'])
    
    try:
        browser.get(my_tedx['url'] + '/transcript')
        # transcript doesn't exists
        if browser.title == "TED | 404: Not Found":
            raise Exception('Transcript not available')
        
        # ensures all timestamps are generated
        try:
            player = WebDriverWait(browser, 5).until(EC.presence_of_element_located((By.CSS_SELECTOR, "video[src]")))
        except:
            pass # sometimes the video-player is inside an iframe and is impossible to locate it but the transcript exists 
        
        # ensures English transcript is selected
        browser.find_element_by_xpath("//select[@name='transcript']/option[@value='en']").click()
    
        l = browser.find_elements_by_css_selector(".Grid.Grid--with-gutter.d\:f\@md.p-b\:4")
    
        transcript = []
        for rel in l:
            sentence = rel.find_elements_by_css_selector("p")[0].text
            # skips useless lines
            if sentence.strip() not in skip_sentences:  
                timestamp = rel.find_elements_by_css_selector("button.sb")[0].text
                transcript.append({"timestamp": timestamp, "sentence": sentence})
        my_tedx['transcript'] = transcript
    except Exception as err:
        print(err)
        my_tedx['transcript'] = {}

    return my_tedx

## Import data and store the new dataset to CSV file

In [None]:
df = pd.read_csv("tedx_dataset.csv")
# splits the dataframe in a list of dict
my_tedx_list = df.to_dict('records')
len(my_tedx_list)

In [None]:
my_tedx_list_final = []
for my_tedx in my_tedx_list:
    my_tedx_list_final.append(get_transcript(my_tedx))
    
print("Done")

In [None]:
transcript_dataset = []
for o in my_tedx_list_final:
    for t in o['transcript']:
        transcript_dataset.append({"idx": o['idx'], "timestamp": t["timestamp"], "sentence": t["sentence"]})

In [None]:
transcript_df = pd.DataFrame.from_dict(transcript_dataset)
transcript_df.to_csv('transcript_dataset.csv', index=False)