# Data Gathering

#### Charlie Liou

## Possible sources of data

- Scrape TED Talks (language model + translation model)
- Scrape Wikipedia good articles (language model)

### Scraping TED Talks

We begin with the standard imports:

In [1]:
import requests, urllib
import time, os, glob
import re
import pandas as pd, numpy as np
from bs4 import BeautifulSoup
from itertools import chain

Function to look at a *single* TED talk page and grab all TED talk links:

In [2]:
alltalks = {}

def get_names(path, alltalks):
    r = urllib.request.urlopen(path).read()
    soup = BeautifulSoup(r, "lxml")
    talks = soup.find_all("a", class_ = "")
    for i in talks:
        if i.attrs['href'].find('/talks/') == 0 and alltalks.get(i.attrs['href']) != 1:
            alltalks[i.attrs['href']] = 1
    
    return alltalks

Function to scrape all TED talk links:

In [3]:
#zlink = "https://www.ted.com/talks?language=zh-tw&page={}&sort=newest"
clink = "https://www.ted.com/talks?language=zh-cn&page={}&sort=newest"

In [4]:
def get_talks(alltalks, link):
    try:
        for i in range(int(len(alltalks) / 36), 66):
            path = link.format(i)
            alltalks = get_names(path, alltalks)
            print(path, len(alltalks))
            time.sleep(4)
    except urllib.request.HTTPError:
        print("TED got mad at you, waiting 30 seconds")
        time.sleep(30)
        get_talks(alltalks, link)

#get_talks(alltalks, zlink)
get_talks(alltalks, clink)

https://www.ted.com/talks?language=zh-cn&page=0&sort=newest 36
https://www.ted.com/talks?language=zh-cn&page=1&sort=newest 36
https://www.ted.com/talks?language=zh-cn&page=2&sort=newest 72
https://www.ted.com/talks?language=zh-cn&page=3&sort=newest 108
https://www.ted.com/talks?language=zh-cn&page=4&sort=newest 144
https://www.ted.com/talks?language=zh-cn&page=5&sort=newest 180
https://www.ted.com/talks?language=zh-cn&page=6&sort=newest 216
https://www.ted.com/talks?language=zh-cn&page=7&sort=newest 252
https://www.ted.com/talks?language=zh-cn&page=8&sort=newest 288
https://www.ted.com/talks?language=zh-cn&page=9&sort=newest 324
https://www.ted.com/talks?language=zh-cn&page=10&sort=newest 360
https://www.ted.com/talks?language=zh-cn&page=11&sort=newest 396
https://www.ted.com/talks?language=zh-cn&page=12&sort=newest 432
https://www.ted.com/talks?language=zh-cn&page=13&sort=newest 468
https://www.ted.com/talks?language=zh-cn&page=14&sort=newest 504
https://www.ted.com/talks?language=zh-

Function to scrape Traditional Chinese and English for a single talk:

In [5]:
def extract_talk(path, talk_name, ch):
        
    langs = ["en", ch]        
    r = urllib.request.urlopen(path).read()
    soup = BeautifulSoup(r, "lxml")
    df = pd.DataFrame()
    
    for i in soup.findAll("link"):
        
        #only look at talks with traditional chinese
        try:
            if i.get("href") != None:
                for lang in langs:
                    if i.attrs["href"].find("?language={}".format(lang)) != -1: 
                        path = i.attrs["href"]
                        r1 = urllib.request.urlopen(path).read()
                        soup1 = BeautifulSoup(r1, "lxml")
                        text_talk = []
                        for i in soup1.findAll("p", class_= "m-b:0"):
                            if lang == ch:
                            #print(i.text.strip().replace("\t", "").replace("\n", "<!>"))
                            #print(len(i.text.strip().replace("\t", "").split("\n")))
                                text_talk.append(i.text.strip().replace("\t", "").replace("\n", ""))
                            else:
                                text_talk.append(i.text.strip().replace("\t", "").replace("\n", " "))
                        text_talk = [" ".join(text_talk)]
                        df1 = pd.DataFrame()
                        df1[lang] = text_talk
                        df = pd.concat([df1, df], axis = 1)
        except KeyError:
            break
    df = pd.concat([pd.DataFrame({"Talk" : [talk_name]}), df], axis = 1)
    df.to_csv(talk_name + '.txt', index = False, sep='\t', encoding='utf-8')
    #return df

In [6]:
#alltalks = [x.replace("?language=zh-tw", "") for x in list(alltalks)]
alltalks = [x.replace("?language=zh-cn", "") for x in list(alltalks)]
alltalks[832]

'/talks/roberto_d_angelo_francesca_fedeli_in_our_baby_s_illness_a_life_lesson'

Function to scrape Traditional Chinese and English for *all* TED talks:

I temporarily stopped at alltalks[833]

In [7]:
#requests.get("https://www.ted.com" + alltalks[601] + "/transcript")

In [8]:
#urllib.request.urlopen("https://www.ted.com" + alltalks[601] + "/transcript").read()

In [9]:
def to_csv(alltalks, talknum, ch):
    try:
        for i in range(talknum, len(alltalks)):
            extract_talk('https://www.ted.com'+ alltalks[i] +'/transcript', alltalks[i][7:], ch)
            time.sleep(3)
            print("On talk number {}".format(talknum + 1) + ", {}% done".format(100 * round((talknum + 1) / len(alltalks), 4)))
            talknum += 1
    except urllib.request.HTTPError:
        print("TED got mad at you, waiting 30 seconds")
        time.sleep(30)
        to_csv(alltalks, talknum, ch)

os.chdir("/Users/csmuser/Desktop/Cal Poly Summer Research 2017/data/zhcn_TED")
        
to_csv(alltalks, 2224, "zh-cn")

On talk number 2225, 96.82% done
On talk number 2226, 96.87% done
On talk number 2227, 96.91% done
On talk number 2228, 96.95% done
TED got mad at you, waiting 30 seconds
On talk number 2229, 97.0% done
On talk number 2230, 97.04% done
On talk number 2231, 97.08% done
On talk number 2232, 97.13000000000001% done
On talk number 2233, 97.17% done
On talk number 2234, 97.21% done
TED got mad at you, waiting 30 seconds
TED got mad at you, waiting 30 seconds
On talk number 2235, 97.26% done
On talk number 2236, 97.3% done
On talk number 2237, 97.35000000000001% done
On talk number 2238, 97.39% done
On talk number 2239, 97.43% done
TED got mad at you, waiting 30 seconds
On talk number 2240, 97.48% done
On talk number 2241, 97.52% done
On talk number 2242, 97.56% done
On talk number 2243, 97.61% done
On talk number 2244, 97.65% done
On talk number 2245, 97.69% done
TED got mad at you, waiting 30 seconds
On talk number 2246, 97.74000000000001% done
TED got mad at you, waiting 30 seconds
On tal

## Wikipedia - good articles

Function for getting all good article links:

In [4]:
def get_good_links():
    path, k = "https://en.wikipedia.org/wiki/Wikipedia:Good_articles/all", []
    r = urllib.request.urlopen(path).read()
    soup = BeautifulSoup(r, "lxml")
    for i in soup.findAll():
        j = i.get("href")
        if j != None and "/wiki/" in j[:6] and ":" not in j:
            k.append(j)
            #print(j)
    return k

In [5]:
goodarticles = get_good_links()

Function for getting main body text from one article:

In [6]:
def get_text(link, k):
    edge = {"\u2009": " ", "\xa0": " ", "&amp;": " & "}
    link = "https://en.wikipedia.org" + link
    r = urllib.request.urlopen(link).read()
    soup = BeautifulSoup(r, "lxml")
    for i in soup.findAll("p"):
        #temp = remove_brackets(str(i))
        temp = str(i)
        if "\n" in temp:
            l = temp.split("\n")
            for j in l: k.append(j.strip())
        else:
            k.append(temp)
    for i in edge:
        for j in range(len(k)):
            if i in k[j]:
                k[j] = k[j].replace(i, edge[i])
    return [i.replace("  ", " ") for i in k if i.strip() != ""]

In [7]:
def scrape_wiki(wikitext, goodarticles, num):
    try:
        for i in range(num, len(goodarticles)):
            wikitext = get_text(goodarticles[i], [])
            #time.sleep(1)
            print("On article {}, {}% done".format(i, round(100 * (i / len(goodarticles)), 4)), goodarticles[i])
            num += 1
            wiki_to_text(wikitext, goodarticles[i])
        #return wikitext
    except urllib.request.URLError:
        print("Wikipedia got mad, waiting 30 seconds")
        time.sleep(30)
        scrape_wiki(wikitext, goodarticles, num)

In [11]:
def wiki_to_text(l, name):
    file = open(name.replace("/wiki/", "").replace("/", "_").replace("*", "_") + ".txt", "w", encoding = "utf-8")  
    for i in l:
        file.write(i + "\n")
    file.close()

Edge case articles:
    - Article 1263: had a slash
        - sol: replace "/" with "_"
    - Max recursion depth @ article 4155
        - sol: change regex to also match [sentence]: 
    - URLError @ article 10340
        - sol: rerun (most likely scraped too many articles at once (article 4155 to article 10340))
    - Article 12434 multiple nesting < > inside of []
        - sol: skip
    - Article 14485 multiple nesting: [NbOF<sub>5</sub>]
        - sol: skip
    - Article 18917 incomplete read
        - sol: redo
    - Article 21072 has a *
        - sol: replace * with _

In [None]:
scrape_wiki([], goodarticles, 21072)

On article 21072, 83.3149% done /wiki/Q*bert
On article 21073, 83.3188% done /wiki/Skeet_Shoot
On article 21074, 83.3228% done /wiki/Space_Cavern
On article 21075, 83.3267% done /wiki/The_Staff_of_Karnath
On article 21076, 83.3307% done /wiki/Telengard
On article 21077, 83.3347% done /wiki/Tranz_Am
On article 21078, 83.3386% done /wiki/Underwurlde


In [52]:
#c = "http://www.bbc.com/zhongwen/trad"
#r = urllib.request.urlopen(c).read()
#soup = BeautifulSoup(r, "lxml")

In [53]:
#soup