## Possible sources of data

- Scrape TED Talks

### Scraping TED Talks

We begin with the standard imports:

In [1]:
import requests, urllib
import time, os, glob
import re
import pandas as pd, numpy as np
from bs4 import BeautifulSoup
from itertools import chain

Function to look at a *single* TED talk page and grab all TED talk links:

In [2]:
def get_names(path, alltalks):
    r = urllib.request.urlopen(path).read()
    soup = BeautifulSoup(r, "lxml")
    talks = soup.find_all("a", class_ = "")
    for i in talks:
        if i.attrs['href'].find('/talks/') == 0 and alltalks.get(i.attrs['href']) != 1:
            alltalks[i.attrs['href']] = 1
    
    return alltalks

Function to scrape all TED talk links:

In [3]:
alltalks = {}
link = "https://www.ted.com/talks?language=zh-tw&page={}&sort=newest"

In [4]:
def get_talks(alltalks):
    try:
        for i in range(int(len(alltalks) / 36), 66):
            path = link.format(i)
            alltalks = get_names(path, alltalks)
            print(path, len(alltalks))
            time.sleep(4)
    except urllib.request.HTTPError:
        print("TED got mad at you, waiting 30 seconds")
        time.sleep(30)
        get_talks(alltalks)

get_talks(alltalks)

https://www.ted.com/talks?language=zh-tw&page=0&sort=newest 36
https://www.ted.com/talks?language=zh-tw&page=1&sort=newest 36
https://www.ted.com/talks?language=zh-tw&page=2&sort=newest 72
https://www.ted.com/talks?language=zh-tw&page=3&sort=newest 108
https://www.ted.com/talks?language=zh-tw&page=4&sort=newest 144
https://www.ted.com/talks?language=zh-tw&page=5&sort=newest 180
https://www.ted.com/talks?language=zh-tw&page=6&sort=newest 216
https://www.ted.com/talks?language=zh-tw&page=7&sort=newest 252
https://www.ted.com/talks?language=zh-tw&page=8&sort=newest 288
https://www.ted.com/talks?language=zh-tw&page=9&sort=newest 324
https://www.ted.com/talks?language=zh-tw&page=10&sort=newest 360
https://www.ted.com/talks?language=zh-tw&page=11&sort=newest 396
https://www.ted.com/talks?language=zh-tw&page=12&sort=newest 432
https://www.ted.com/talks?language=zh-tw&page=13&sort=newest 468
https://www.ted.com/talks?language=zh-tw&page=14&sort=newest 504
https://www.ted.com/talks?language=zh-

Function to scrape Traditional Chinese and English for a single talk:

In [5]:
def extract_talk(path, talk_name):
        
    langs = ["en", "zh-tw"]        
    titles = "(?<![A-Z][a-z])(?<![A-Z][a-z][a-z])\."
    r = urllib.request.urlopen(path).read()
    soup = BeautifulSoup(r, "lxml")
    df = pd.DataFrame()
    
    for i in soup.findAll("link"):
        
        #only look at talks with traditional chinese
        try:
            if i.get("href") != None:
                for lang in langs:
                    if i.attrs["href"].find("?language={}".format(lang)) != -1: 
                        path = i.attrs["href"]
                        r1 = urllib.request.urlopen(path).read()
                        soup1 = BeautifulSoup(r1, "lxml")
                        text_talk = []
                        #print(soup1)
                        for i in soup1.findAll("p", class_= "m-b:0"):
                            #print(i.text.strip().replace("\t", "").replace("\n", ""))
                            if lang == "zh-tw":
                                #text_talk.append(i.text.replace("\t","").replace("\n", "")
                                                 #.strip().split(u"。"))
                                text_talk.append(i.text.strip().replace("\t", "").replace("\n", ""))
                            else:
                                #text_talk.append(re.split(titles, i.text.replace("\t","").replace("\n", " ")
                                             #.strip()))
                                text_talk.append(i.text.strip().replace("\t", "").replace("\n", " "))
                            #text_talk.append(i.text.strip().replace("\t","").split("\n")) #split line by line
                        #text_talk = [x for x in list(chain.from_iterable(text_talk)) if x != ""]
                        #text_talk = list(chain.from_iterable(text_talk))
                        #print(" ".join(text_talk))
                        text_talk = [" ".join(text_talk)]
                        #print(talk_name + " " + lang + " " + str(len(text_talk)))
                        df1 = pd.DataFrame()
                        df1[lang] = text_talk
                        df = pd.concat([df1, df], axis = 1)
        except KeyError:
            break
    df = pd.concat([pd.DataFrame({"Talk" : [talk_name]}), df], axis = 1)
    df.to_csv(talk_name + '.txt', index = False, sep='\t', encoding='utf-8')
    #return df

One possible edgecase (lacking perfectly aligned periods):

In [6]:
?urllib.request.urlopen

In [7]:
extract_talk("https://www.ted.com/talks/shubhendu_sharma_an_engineers_vision_for_tiny_forests_everywhere/transcript", "shubhendu_sharma_an_engineers_vision_for_tiny_forests_everywhere")

In [8]:
alltalks = [x.replace("?language=zh-tw", "") for x in list(alltalks)]
alltalks[832]

'/talks/steve_ramirez_and_xu_liu_a_mouse_a_laser_beam_a_manipulated_memory'

Function to scrape Traditional Chinese and English for *all* TED talks:

I temporarily stopped at alltalks[833]

In [None]:
def to_csv(alltalks, talknum):
    try:
        for i in range(talknum, len(alltalks)):
            extract_talk('https://www.ted.com'+ alltalks[i] +'/transcript', alltalks[i][7:])
            time.sleep(3)
            print("On talk number {}".format(talknum + 1) + ", {}% done".format(round((talknum + 1) / len(alltalks), 4)))
            talknum += 1
    except urllib.request.HTTPError:
        print("TED got mad at you, waiting 30 seconds")
        time.sleep(30)
        to_csv(alltalks, talknum)

alltalks = [x.replace("?language=zh-tw", "") for x in list(alltalks)]
to_csv(alltalks, 830)

On talk number 831, 0.3594% done
On talk number 832, 0.3599% done
On talk number 833, 0.3603% done
On talk number 834, 0.3607% done
TED got mad at you, waiting 30 seconds
On talk number 835, 0.3612% done
TED got mad at you, waiting 30 seconds
On talk number 836, 0.3616% done
On talk number 837, 0.362% done
On talk number 838, 0.3625% done
TED got mad at you, waiting 30 seconds
On talk number 839, 0.3629% done
On talk number 840, 0.3633% done
On talk number 841, 0.3638% done
On talk number 842, 0.3642% done
On talk number 843, 0.3646% done
On talk number 844, 0.3651% done
TED got mad at you, waiting 30 seconds
On talk number 845, 0.3655% done
On talk number 846, 0.3659% done
On talk number 847, 0.3663% done
On talk number 848, 0.3668% done
On talk number 849, 0.3672% done
On talk number 850, 0.3676% done
TED got mad at you, waiting 30 seconds
On talk number 851, 0.3681% done
On talk number 852, 0.3685% done
On talk number 853, 0.3689% done
On talk number 854, 0.3694% done
On talk number

On talk number 1014, 0.4386% done
On talk number 1015, 0.439% done
On talk number 1016, 0.4394% done
TED got mad at you, waiting 30 seconds
TED got mad at you, waiting 30 seconds
On talk number 1017, 0.4399% done
On talk number 1018, 0.4403% done
On talk number 1019, 0.4407% done
On talk number 1020, 0.4412% done
On talk number 1021, 0.4416% done
TED got mad at you, waiting 30 seconds
TED got mad at you, waiting 30 seconds
On talk number 1022, 0.442% done
On talk number 1023, 0.4425% done
On talk number 1024, 0.4429% done
On talk number 1025, 0.4433% done
TED got mad at you, waiting 30 seconds
TED got mad at you, waiting 30 seconds
On talk number 1026, 0.4438% done
On talk number 1027, 0.4442% done
On talk number 1028, 0.4446% done
On talk number 1029, 0.4451% done
On talk number 1030, 0.4455% done
TED got mad at you, waiting 30 seconds
TED got mad at you, waiting 30 seconds
On talk number 1031, 0.4459% done
On talk number 1032, 0.4464% done
On talk number 1033, 0.4468% done
On talk nu

On talk number 1194, 0.5164% done
On talk number 1195, 0.5169% done
TED got mad at you, waiting 30 seconds
On talk number 1196, 0.5173% done
On talk number 1197, 0.5177% done
TED got mad at you, waiting 30 seconds
On talk number 1198, 0.5182% done
On talk number 1199, 0.5186% done
TED got mad at you, waiting 30 seconds
On talk number 1200, 0.519% done
TED got mad at you, waiting 30 seconds
On talk number 1201, 0.5195% done
On talk number 1202, 0.5199% done
On talk number 1203, 0.5203% done
On talk number 1204, 0.5208% done
TED got mad at you, waiting 30 seconds
TED got mad at you, waiting 30 seconds
On talk number 1205, 0.5212% done
On talk number 1206, 0.5216% done
On talk number 1207, 0.5221% done
On talk number 1208, 0.5225% done
On talk number 1209, 0.5229% done
TED got mad at you, waiting 30 seconds
On talk number 1210, 0.5234% done
On talk number 1211, 0.5238% done
On talk number 1212, 0.5242% done
On talk number 1213, 0.5247% done
On talk number 1214, 0.5251% done
On talk number

## Reading in all txt files

In [94]:
glob.glob("*.csv")[2]

'mehdi_ordikhani_seyedlar_what_happens_in_your_brain_when_you_pay_attention.csv'

In [22]:
#library computer
path = "C:\\Users\\liblabs-user\\Desktop\\Cal Poly Summer Research 2017"
os.chdir(path)

#sierra's computer
#spath = "/Users/sierra/Desktop/Cal Poly Summer Research 2017"
#os.chdir(spath)

pd.read_csv(glob.glob("*.txt")[0], sep = "\t", encoding = "utf-8")

Unnamed: 0,Talk,en,zh-tw
0,abigail_marsh_why_some_people_are_more_altruis...,"There's a man out there, somewhere, who looks ...",一位男子站在那，長的有點神似演員伊卓瑞斯·艾巴，或者是艾巴20年前的樣子。除了他鋌而走險救了...
