## Corpus collection code

### Instruction for creating the entire corpus

1. Import required libraries and packages
2. Define functions
3. Define variables and seeds. 
    - You can change the `languages` variable to collect parallel transcripts in languages of your interest. By default, the code collects parallel transcripts of English, Korean, and Chinese (simplified). 
    - You can increase the value of `NUM_DOCUMENTS` to increase the number of documents that you will collect (by default, it is 10).
    - The current seed has two talks from TED.
4. Run the cells to scrape the transcripts until you have collected a brown-size (or desired-size) corpus. 
5. All transcripts will be in `/transcripts/{language}/` folders.

### Step-by-step algorithm of the scraping code

1. Check existing output files of English, and update the `done_list`
2. Get `todo_talks` from `todo_list_files`
3. while there are entries in `todo_talks` and `n` < `NUM_DOCUMENTS`:
    - 3.1. Pull out current talk from `todo_talks`
    - 3.2. Check if the current talk have the transcripts of all desired languages (default: ko, en, zh-cn)
    - 3.3. while scraping does not succeed and until we try at least 10 times:
        - 3.3.1. Try to scrape TED transcripts for each language. If this succeeds, change `success = True`. Also, get a list of related talks with respect to the current talk.
        - 3.3.2. If there is an error, then try to sleep for one second and try again.
    - 3.4. For talk in talks, if talk is not in the `done_list`, then add the talk in `todo_talks`.

### Step-by-step of reading the paralleled transcripts in three languages.  
1. Load the corpus by the `.load()` function.
2. Get paralleled transcripts by `.transcripts`.


## Code for building Ted talks corpus

### Imports
- These libraries/packages are required to be run all cells in this document

In [14]:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import time
import json
import os
import urllib.request as urllib2
from nltk import pos_tag, sent_tokenize, word_tokenize
import csv
from collections import defaultdict

### Start Code

### Define functions for Ted_talks Class.
- `get_profile` : Scrape transcripts given a soup object
- `get_talks` : From one Ted Talk website, get more talks name from the recommend list
- `get_talker`: From one Ted Talk website return talker's name
- `get_views`: From one Ted Talk website return number of views
- `get_date`: From one Ted Talk website return upload date
- `update_csv`: Update the meta data to csv file
- `write_json`: Update the json file from the talk dict scraped by ted talks websites

In [2]:
def get_profile(soup):
    """Return transcript from soup of Ted Talk website."""
    transcript = ""
    for node in soup.find_all("div", {"class" : "Grid__cell flx-s:1 p-r:4"}):
        text = node.get_text().strip().replace('\t\t\t\t\t\t\t\t\t\t\t', ' ')
        if text:
            transcript = transcript + '\n' + "[PARAGRAPH]" + '\n' + text
    return transcript

In [3]:
def get_talks(soup):
    """From one Ted Talk website get more talks name from the recommend list"""
    talks = []
    regex = r"slug\":\"((\w+_)+\w+)\""
    for match in re.finditer(regex,str(soup)):
        talks.append(match.group(1))
    return talks[3:-2]

In [4]:
def get_talker(soup):
    """From one Ted Talk website return talker's name"""
    for node in soup.find_all("meta", {"name" : "author"}):
        name = node['content']
    
    return name

In [5]:
def get_views(soup):
    """From one Ted Talk website return number of views"""
    for node in soup.find_all("span", {"classname" : "f-w:400"}):
        views = node.parent.text.strip().split('\n')[0]
    
    
    return views

In [6]:
def get_date(soup):
    """From one Ted Talk website return upload date"""
    
    for node in soup.find_all("meta", {"itemprop" : "uploadDate"}):
        date = node['content'][:10]
    
    return date

In [7]:
def update_csv(talk_dict):
    """update the csv file"""
    talk_dict.pop('text', None)
    with open('../src/ted_talks.csv', 'a', newline='') as csvfile:
        fieldnames = talk_dict.keys()
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
#         writer.writeheader()
        writer.writerow(talk_dict)


In [8]:
def write_json(talk_dict, file_path='../transcripts/en/ted_talks_en.json'): 
    """Update the json file from the talk dict scraped by ted talks websites"""    
    if os.path.exists(file_path):
        with open(file_path, 'r', encoding='utf-8') as json_file:
            temp = json.load(json_file)
            if temp[-1]['title'] != talk_dict['title']:
                temp.append(talk_dict)
        with open(file_path, 'w', encoding='utf-8') as json_file:
            json.dump(temp, json_file,  indent=4, separators=(',', ':'))
            
    else:
        with open(file_path, 'a', encoding='utf-8') as json_file:
            json.dump([talk_dict], json_file,  indent=4, separators=(',', ':'))

### Create the Corpus Class

#### NOTE
**If you want to completely scrape from the beginning, make sure erase `todo.txt` file in the `/src/` folder, and all transcripts under the `/transcripts/` folder.**

#### Define the TED talks corpus class

In [80]:
class TED_Talks():
    
    def __init__(self, languages = ['en', 'ko', 'zh-cn']):
        self.languages = languages
        self.todo_list_file = "todo.txt"
        self.done_list = set()
        self.transcrpits = []
        
    def scrape(self, NUM_DOCUMENTS, seed_list = []):
        """Scrape ted talks transcripts from seed list. """
        
        todo_talks = seed_list
        
        # check exist output file of english, update the done_list
        file_path = f"../transcripts/{self.languages[0]}/ted_talks_{self.languages[0]}.json"
        if os.path.exists(file_path):
            f = open(file_path,encoding="utf-8")
            file = json.load(f)
            for talk in file:
                self.done_list.add(talk['title'])
            f.close()

        # get todo_talks from todo_list_file            
        if os.path.exists(self.todo_list_file):
            f = open(todo_list_file,encoding="utf-8")
            todo_talks = set(json.load(f))
            f.close()
            
            
        n = 0
        while todo_talks and n < NUM_DOCUMENTS:
            n+=1
            current_talk = todo_talks.pop()
            print(current_talk)
            print(len(todo_talks))
            print(len(self.done_list))
            success = False
            count = 0

            # check if the talks have the transcrpits of all languages
            exist_websites = 1 
            for language in self.languages:
                try: # need to open with try
                    page = urlopen(f'https://www.ted.com/talks/{current_talk}/transcript?language={language}')
                except urllib2.HTTPError as e:
                    if e.getcode() == 404: # check the return code
                        exist_websites = 0 # if no transcrpit in some language, set 0
                time.sleep(1)

            if exist_websites == 0:
                print("no target languages")
                continue

            # scrape the transcrpt and update the file.
            while not success and count < 10:
                try:
                    
                    talk_link = f'https://www.ted.com/talks/{current_talk}/transcript'
                    name = get_talker(BeautifulSoup(urlopen(talk_link), 'html.parser'))
                    
                    for language in languages:
                        # get meta data of the talk
                        file_path = f"../transcripts/{language}/ted_talks_{language}.json"
                        url = f"{talk_link}?language={language}"
                        ted_soup = BeautifulSoup(urlopen(url), 'html.parser')
                        
                        # get meta data
                        name = get_talker(ted_soup)
                        text = get_profile(ted_soup)
                        text_length = len(text)
                        talks = get_talks(ted_soup)
                        views = get_views(ted_soup)
                        date = get_date(ted_soup)
                        talk_dict = {'title':current_talk, 'talker':name, 'text_length':text_length, 'views':views, 'language':language, 'text':text, 'url':url}
#                         print(talk_dict['talker'])
                        write_json(talk_dict, file_path)
                        update_csv(talk_dict)

                        time.sleep(1)
                        
                    self.done_list.add(current_talk)    
                     
                    success = True
                except:
                    print(count, " fail!")
                    count += 1
                    time.sleep(1)
            if count == 10:
                continue
            
            # update the todo list.
            for talk in talks:
                if talk not in self.done_list:
                    todo_talks.add(talk)
            fout = open(self.todo_list_file,"w")
            json.dump(list(todo_talks),fout)
            fout.close()
            done_list.add(current_talk)
            time.sleep(1)
            
    def load(self):
        """Load the scraped json files"""
        
        title2ind_en = defaultdict(int)
        title2ind_ko = defaultdict(int)
        title2ind_cn = defaultdict(int)


        file_path = f"../transcripts/en/ted_talks_en.json"
        f = open(file_path,encoding="utf-8")
        file_en = json.load(f)
        for i in range(len(file_en)):
            title2ind_en[file_en[i]['title']] = i


        file_path = f"../transcripts/zh-cn/ted_talks_zh-cn.json"
        f = open(file_path,encoding="utf-8")
        file_cn = json.load(f)
        for i in range(len(file_cn)):
            title2ind_cn[file_cn[i]['title']] = i

        file_path = f"../transcripts/ko/ted_talks_ko.json"
        f = open(file_path,encoding="utf-8")
        file_ko = json.load(f)
        for i in range(len(file_ko)):
            title2ind_ko[file_ko[i]['title']] = i

        share_titles = set(title2ind_en).intersection(set(title2ind_ko)).intersection((set(title2ind_cn)))

        for title in share_titles:
            text_lst = []
            text_lst.append(file_en[title2ind_en[title]]['text'])
            text_lst.append(file_ko[title2ind_ko[title]]['text'])
            text_lst.append(file_cn[title2ind_cn[title]]['text'])

            self.transcrpits.append(text_lst)   

#### Create an instance of Ted_talks class.

In [69]:
ted_talks = TED_Talks()

#### Load the scraped json files

In [78]:
ted_talks.load()

#### Scrape the Ted Talks transcripts
- Change the value of `NUM_DOCUMENTS` to scrape the number of documnets of your interest

In [76]:
# variables
NUM_DOCUMENTS = 10
languages = ['en', 'ko', 'zh-cn']
todo_list_file = "todo.txt"

# set seeds
seed_list = set(["elizabeth_hellmuth_margulis_why_we_love_repetition_in_music", "anita_collins_how_playing_an_instrument_benefits_your_brain"])

# uncomment below to start scrape
# ted_talks.scrape(NUM_DOCUMENTS)

#### To Stop and To Continue Scraping

Every time the scraped transcripts reach the number of documents that you set in `NUM_DOCUMNETS`, or the scraping code has stopped due to an unexpected error such as kernel interruption, computer shutdown, or network issues, the **to do list** will be saved as `todo.txt` in the `src` folder, and the transcripts will be saved in the `transcripts` folder. 

If you run the code above (`ted_talks.scrape(NUM_DOCUMENTS)`), it will continue scraping transcripts from the top talk in `todo.txt`. The transcripts will be appended to the existing transcripts, updating the `done_list`.

#### Get the paralleled transcrpits from corpus

In [73]:
ted_talks.transcrpits[5]

['\n[PARAGRAPH]\nIt\'s 4 a.m., \nand the big test is in eight hours,\n followed by a piano recital.\n You\'ve been studying and playing for days,\nbut you still don\'t feel ready for either.\n So, what can you do?\n Well, you can drink another cup of coffee\n and spend the next few hours \ncramming and practicing,\n but believe it or not,\n you might be better off closing the books,\nputting away the music,\n and going to sleep.\n[PARAGRAPH]\nSleep occupies nearly \na third of our lives,\n but many of us give surprisingly\nlittle attention and care to it.\n This neglect is often the result \nof a major misunderstanding.\n Sleep isn\'t lost time,\n or just a way to rest \nwhen all our important work is done.\n Instead, it\'s a critical function,\n during which your body balances\nand regulates its vital systems,\n affecting respiration\n and regulating everything from circulation\nto growth and immune response.\n[PARAGRAPH]\nThat\'s great, but you can worry about \nall those things afte