## Possible sources of data

- Scrape TED Talks

### Scraping TED Talks

We begin with the standard imports:

In [1]:
import requests, urllib
import time, os, glob
import re
import pandas as pd, numpy as np
from bs4 import BeautifulSoup
from itertools import chain

Function to look at a *single* TED talk page and grab all TED talk links:

In [2]:
def get_names(path, alltalks):
    r = urllib.request.urlopen(path).read()
    soup = BeautifulSoup(r, "lxml")
    talks = soup.find_all("a", class_ = "")
    for i in talks:
        if i.attrs['href'].find('/talks/') == 0 and alltalks.get(i.attrs['href']) != 1:
            alltalks[i.attrs['href']] = 1
    
    return alltalks

Function to scrape all TED talk links:

In [3]:
alltalks = {}
link = "https://www.ted.com/talks?language=zh-tw&page={}&sort=newest"

In [4]:
def get_talks(alltalks):
    try:
        for i in range(int(len(alltalks) / 36), 66):
            path = link.format(i)
            alltalks = get_names(path, alltalks)
            print(path, len(alltalks))
            time.sleep(4)
    except urllib.request.HTTPError:
        print("TED got mad at you, waiting 30 seconds")
        time.sleep(30)
        get_talks(alltalks)

get_talks(alltalks)

https://www.ted.com/talks?language=zh-tw&page=0&sort=newest 36
https://www.ted.com/talks?language=zh-tw&page=1&sort=newest 36
https://www.ted.com/talks?language=zh-tw&page=2&sort=newest 72
https://www.ted.com/talks?language=zh-tw&page=3&sort=newest 108
https://www.ted.com/talks?language=zh-tw&page=4&sort=newest 144
https://www.ted.com/talks?language=zh-tw&page=5&sort=newest 180
https://www.ted.com/talks?language=zh-tw&page=6&sort=newest 216
https://www.ted.com/talks?language=zh-tw&page=7&sort=newest 252
https://www.ted.com/talks?language=zh-tw&page=8&sort=newest 288
https://www.ted.com/talks?language=zh-tw&page=9&sort=newest 324
https://www.ted.com/talks?language=zh-tw&page=10&sort=newest 360
https://www.ted.com/talks?language=zh-tw&page=11&sort=newest 396
https://www.ted.com/talks?language=zh-tw&page=12&sort=newest 432
https://www.ted.com/talks?language=zh-tw&page=13&sort=newest 468
https://www.ted.com/talks?language=zh-tw&page=14&sort=newest 504
https://www.ted.com/talks?language=zh-

Function to scrape Traditional Chinese and English for a single talk:

In [5]:
def extract_talk(path, talk_name):
        
    langs = ["en", "zh-tw"]        
    titles = "(?<![A-Z][a-z])(?<![A-Z][a-z][a-z])\."
    r = urllib.request.urlopen(path).read()
    soup = BeautifulSoup(r, "lxml")
    df = pd.DataFrame()
    
    for i in soup.findAll("link"):
        
        #only look at talks with traditional chinese
        try:
            if i.get("href") != None:
                for lang in langs:
                    if i.attrs["href"].find("?language={}".format(lang)) != -1: 
                        path = i.attrs["href"]
                        r1 = urllib.request.urlopen(path).read()
                        soup1 = BeautifulSoup(r1, "lxml")
                        text_talk = []
                        #print(soup1)
                        for i in soup1.findAll("p", class_= "m-b:0"):
                            #print(i.text.strip().replace("\t", "").replace("\n", ""))
                            if lang == "zh-tw":
                                #text_talk.append(i.text.replace("\t","").replace("\n", "")
                                                 #.strip().split(u"。"))
                                text_talk.append(i.text.strip().replace("\t", "").replace("\n", ""))
                            else:
                                #text_talk.append(re.split(titles, i.text.replace("\t","").replace("\n", " ")
                                             #.strip()))
                                text_talk.append(i.text.strip().replace("\t", "").replace("\n", " "))
                            #text_talk.append(i.text.strip().replace("\t","").split("\n")) #split line by line
                        #text_talk = [x for x in list(chain.from_iterable(text_talk)) if x != ""]
                        #text_talk = list(chain.from_iterable(text_talk))
                        #print(" ".join(text_talk))
                        text_talk = [" ".join(text_talk)]
                        #print(talk_name + " " + lang + " " + str(len(text_talk)))
                        df1 = pd.DataFrame()
                        df1[lang] = text_talk
                        df = pd.concat([df1, df], axis = 1)
        except KeyError:
            break
    df = pd.concat([pd.DataFrame({"Talk" : [talk_name]}), df], axis = 1)
    df.to_csv(talk_name + '.txt', index = False, sep='\t', encoding='utf-8')
    #return df

One possible edgecase (lacking perfectly aligned periods):

In [6]:
?urllib.request.urlopen

In [7]:
extract_talk("https://www.ted.com/talks/shubhendu_sharma_an_engineers_vision_for_tiny_forests_everywhere/transcript", "shubhendu_sharma_an_engineers_vision_for_tiny_forests_everywhere")

In [8]:
alltalks = [x.replace("?language=zh-tw", "") for x in list(alltalks)]
alltalks[832]

'/talks/steve_ramirez_and_xu_liu_a_mouse_a_laser_beam_a_manipulated_memory'

Function to scrape Traditional Chinese and English for *all* TED talks:

I temporarily stopped at alltalks[833]

In [10]:
def to_csv(alltalks, talknum):
    try:
        for i in range(talknum, len(alltalks)):
            extract_talk('https://www.ted.com'+ alltalks[i] +'/transcript', alltalks[i][7:])
            time.sleep(3)
            print("On talk number {}".format(talknum + 1) + ", {}% done".format(round((talknum + 1) / len(alltalks), 4)))
            talknum += 1
    except urllib.request.HTTPError:
        print("TED got mad at you, waiting 30 seconds")
        time.sleep(30)
        to_csv(alltalks, talknum)

alltalks = [x.replace("?language=zh-tw", "") for x in list(alltalks)]
to_csv(alltalks, 830)

On talk number 831, 0.3594% done
On talk number 832, 0.3599% done
On talk number 833, 0.3603% done
On talk number 834, 0.3607% done
TED got mad at you, waiting 30 seconds
On talk number 835, 0.3612% done
TED got mad at you, waiting 30 seconds
On talk number 836, 0.3616% done
On talk number 837, 0.362% done
On talk number 838, 0.3625% done
TED got mad at you, waiting 30 seconds
On talk number 839, 0.3629% done
On talk number 840, 0.3633% done
On talk number 841, 0.3638% done
On talk number 842, 0.3642% done
On talk number 843, 0.3646% done
On talk number 844, 0.3651% done
TED got mad at you, waiting 30 seconds
On talk number 845, 0.3655% done
On talk number 846, 0.3659% done
On talk number 847, 0.3663% done
On talk number 848, 0.3668% done
On talk number 849, 0.3672% done
On talk number 850, 0.3676% done
TED got mad at you, waiting 30 seconds
On talk number 851, 0.3681% done
On talk number 852, 0.3685% done
On talk number 853, 0.3689% done
On talk number 854, 0.3694% done
On talk number

On talk number 1014, 0.4386% done
On talk number 1015, 0.439% done
On talk number 1016, 0.4394% done
TED got mad at you, waiting 30 seconds
TED got mad at you, waiting 30 seconds
On talk number 1017, 0.4399% done
On talk number 1018, 0.4403% done
On talk number 1019, 0.4407% done
On talk number 1020, 0.4412% done
On talk number 1021, 0.4416% done
TED got mad at you, waiting 30 seconds
TED got mad at you, waiting 30 seconds
On talk number 1022, 0.442% done
On talk number 1023, 0.4425% done
On talk number 1024, 0.4429% done
On talk number 1025, 0.4433% done
TED got mad at you, waiting 30 seconds
TED got mad at you, waiting 30 seconds
On talk number 1026, 0.4438% done
On talk number 1027, 0.4442% done
On talk number 1028, 0.4446% done
On talk number 1029, 0.4451% done
On talk number 1030, 0.4455% done
TED got mad at you, waiting 30 seconds
TED got mad at you, waiting 30 seconds
On talk number 1031, 0.4459% done
On talk number 1032, 0.4464% done
On talk number 1033, 0.4468% done
On talk nu

On talk number 1194, 0.5164% done
On talk number 1195, 0.5169% done
TED got mad at you, waiting 30 seconds
On talk number 1196, 0.5173% done
On talk number 1197, 0.5177% done
TED got mad at you, waiting 30 seconds
On talk number 1198, 0.5182% done
On talk number 1199, 0.5186% done
TED got mad at you, waiting 30 seconds
On talk number 1200, 0.519% done
TED got mad at you, waiting 30 seconds
On talk number 1201, 0.5195% done
On talk number 1202, 0.5199% done
On talk number 1203, 0.5203% done
On talk number 1204, 0.5208% done
TED got mad at you, waiting 30 seconds
TED got mad at you, waiting 30 seconds
On talk number 1205, 0.5212% done
On talk number 1206, 0.5216% done
On talk number 1207, 0.5221% done
On talk number 1208, 0.5225% done
On talk number 1209, 0.5229% done
TED got mad at you, waiting 30 seconds
On talk number 1210, 0.5234% done
On talk number 1211, 0.5238% done
On talk number 1212, 0.5242% done
On talk number 1213, 0.5247% done
On talk number 1214, 0.5251% done
On talk number

On talk number 1372, 0.5934% done
On talk number 1373, 0.5939% done
TED got mad at you, waiting 30 seconds
On talk number 1374, 0.5943% done
On talk number 1375, 0.5947% done
On talk number 1376, 0.5952% done
On talk number 1377, 0.5956% done
On talk number 1378, 0.596% done
On talk number 1379, 0.5965% done
TED got mad at you, waiting 30 seconds
On talk number 1380, 0.5969% done
On talk number 1381, 0.5973% done
On talk number 1382, 0.5978% done
On talk number 1383, 0.5982% done
On talk number 1384, 0.5986% done
TED got mad at you, waiting 30 seconds
On talk number 1385, 0.599% done
TED got mad at you, waiting 30 seconds
On talk number 1386, 0.5995% done
On talk number 1387, 0.5999% done
On talk number 1388, 0.6003% done
On talk number 1389, 0.6008% done
TED got mad at you, waiting 30 seconds
TED got mad at you, waiting 30 seconds
On talk number 1390, 0.6012% done
On talk number 1391, 0.6016% done
On talk number 1392, 0.6021% done
On talk number 1393, 0.6025% done
On talk number 1394,

On talk number 1552, 0.6713% done
On talk number 1553, 0.6717% done
On talk number 1554, 0.6721% done
TED got mad at you, waiting 30 seconds
On talk number 1555, 0.6726% done
On talk number 1556, 0.673% done
On talk number 1557, 0.6734% done
On talk number 1558, 0.6739% done
On talk number 1559, 0.6743% done
On talk number 1560, 0.6747% done
TED got mad at you, waiting 30 seconds
On talk number 1561, 0.6752% done
On talk number 1562, 0.6756% done
On talk number 1563, 0.676% done
On talk number 1564, 0.6765% done
TED got mad at you, waiting 30 seconds
On talk number 1565, 0.6769% done
On talk number 1566, 0.6773% done
On talk number 1567, 0.6778% done
On talk number 1568, 0.6782% done
TED got mad at you, waiting 30 seconds
On talk number 1569, 0.6786% done
On talk number 1570, 0.6791% done
On talk number 1571, 0.6795% done
TED got mad at you, waiting 30 seconds
On talk number 1572, 0.6799% done
TED got mad at you, waiting 30 seconds
On talk number 1573, 0.6804% done
On talk number 1574,

TED got mad at you, waiting 30 seconds
On talk number 1736, 0.7509% done
On talk number 1737, 0.7513% done
On talk number 1738, 0.7517% done
On talk number 1739, 0.7522% done
TED got mad at you, waiting 30 seconds
On talk number 1740, 0.7526% done
TED got mad at you, waiting 30 seconds
On talk number 1741, 0.753% done
On talk number 1742, 0.7535% done
On talk number 1743, 0.7539% done
On talk number 1744, 0.7543% done
On talk number 1745, 0.7548% done
On talk number 1746, 0.7552% done
TED got mad at you, waiting 30 seconds
On talk number 1747, 0.7556% done
On talk number 1748, 0.7561% done
On talk number 1749, 0.7565% done
TED got mad at you, waiting 30 seconds
On talk number 1750, 0.7569% done
TED got mad at you, waiting 30 seconds
On talk number 1751, 0.7574% done
On talk number 1752, 0.7578% done
On talk number 1753, 0.7582% done
On talk number 1754, 0.7587% done
On talk number 1755, 0.7591% done
TED got mad at you, waiting 30 seconds
On talk number 1756, 0.7595% done
On talk number

On talk number 1915, 0.8283% done
On talk number 1916, 0.8287% done
On talk number 1917, 0.8292% done
On talk number 1918, 0.8296% done
TED got mad at you, waiting 30 seconds
On talk number 1919, 0.83% done
On talk number 1920, 0.8304% done
TED got mad at you, waiting 30 seconds
On talk number 1921, 0.8309% done
On talk number 1922, 0.8313% done
On talk number 1923, 0.8317% done
TED got mad at you, waiting 30 seconds
On talk number 1924, 0.8322% done
On talk number 1925, 0.8326% done
On talk number 1926, 0.833% done
On talk number 1927, 0.8335% done
TED got mad at you, waiting 30 seconds
On talk number 1928, 0.8339% done
On talk number 1929, 0.8343% done
On talk number 1930, 0.8348% done
On talk number 1931, 0.8352% done
On talk number 1932, 0.8356% done
On talk number 1933, 0.8361% done
TED got mad at you, waiting 30 seconds
On talk number 1934, 0.8365% done
On talk number 1935, 0.8369% done
On talk number 1936, 0.8374% done
On talk number 1937, 0.8378% done
On talk number 1938, 0.838

On talk number 2103, 0.9096% done
On talk number 2104, 0.91% done
On talk number 2105, 0.9105% done
On talk number 2106, 0.9109% done
TED got mad at you, waiting 30 seconds
On talk number 2107, 0.9113% done
On talk number 2108, 0.9118% done
On talk number 2109, 0.9122% done
TED got mad at you, waiting 30 seconds
On talk number 2110, 0.9126% done
On talk number 2111, 0.9131% done
On talk number 2112, 0.9135% done
On talk number 2113, 0.9139% done
On talk number 2114, 0.9144% done
On talk number 2115, 0.9148% done
TED got mad at you, waiting 30 seconds
On talk number 2116, 0.9152% done
On talk number 2117, 0.9157% done
On talk number 2118, 0.9161% done
TED got mad at you, waiting 30 seconds
On talk number 2119, 0.9165% done
TED got mad at you, waiting 30 seconds
On talk number 2120, 0.917% done
On talk number 2121, 0.9174% done
On talk number 2122, 0.9178% done
On talk number 2123, 0.9183% done
TED got mad at you, waiting 30 seconds
On talk number 2124, 0.9187% done
TED got mad at you, w

On talk number 2287, 0.9892% done
TED got mad at you, waiting 30 seconds
On talk number 2288, 0.9896% done
On talk number 2289, 0.9901% done
On talk number 2290, 0.9905% done
On talk number 2291, 0.9909% done
On talk number 2292, 0.9913% done
TED got mad at you, waiting 30 seconds
On talk number 2293, 0.9918% done
On talk number 2294, 0.9922% done
On talk number 2295, 0.9926% done
On talk number 2296, 0.9931% done
On talk number 2297, 0.9935% done
On talk number 2298, 0.9939% done
TED got mad at you, waiting 30 seconds
On talk number 2299, 0.9944% done
TED got mad at you, waiting 30 seconds
On talk number 2300, 0.9948% done
On talk number 2301, 0.9952% done
On talk number 2302, 0.9957% done
On talk number 2303, 0.9961% done
TED got mad at you, waiting 30 seconds
On talk number 2304, 0.9965% done
TED got mad at you, waiting 30 seconds
On talk number 2305, 0.997% done
On talk number 2306, 0.9974% done
On talk number 2307, 0.9978% done
On talk number 2308, 0.9983% done
On talk number 2309

2312

## Reading in all txt files

In [94]:
glob.glob("*.csv")[2]

'mehdi_ordikhani_seyedlar_what_happens_in_your_brain_when_you_pay_attention.csv'

In [22]:
#library computer
path = "C:\\Users\\liblabs-user\\Desktop\\Cal Poly Summer Research 2017"
os.chdir(path)

#sierra's computer
#spath = "/Users/sierra/Desktop/Cal Poly Summer Research 2017"
#os.chdir(spath)

pd.read_csv(glob.glob("*.txt")[0], sep = "\t", encoding = "utf-8")

Unnamed: 0,Talk,en,zh-tw
0,abigail_marsh_why_some_people_are_more_altruis...,"There's a man out there, somewhere, who looks ...",一位男子站在那，長的有點神似演員伊卓瑞斯·艾巴，或者是艾巴20年前的樣子。除了他鋌而走險救了...
