## Webscarping and Sentiment Analysis 

Quick overview:

## BeautifulSoup

Documentation: (https://www.crummy.com/software/BeautifulSoup/bs4/doc/)! 

* Python library for pulling data out of HTML and XML files
* Convert HTML and XML into navigable Python objects
* Objects are represented as nodes in a tree
    * Each node has zero or more childeren
    * Each node has zero or more parents 

HTML example:

<html>
   <head>
      <title> Document Titel</title>
   </head>
   <body>
      <h1>Tutorialspoint Online Library</h1>
      <p<<b>It's all Free</b></p>
   </body>
   </body>
</html>

<center><img src="https://www.tutorialspoint.com/beautiful_soup/images/html_document.jpg" width="400"/></center>
(source: tutorialspoint.com)

## Example: Web scraping the COPSS Awards Recipients

Get information regarding statisticians that were awarded with the COPSS Presidents' Award. Get a dataframe with 3 columns (`Year`, `Name`, `Institute`). For example (`1981`, `Peter J. Bickel`, `University of California, Berkeley`). Set `Year` to be the index, `Name` and `Institute` as column names.

https://en.wikipedia.org/wiki/COPSS_Presidents%27_Award

* **Step 1:** Go to the website, use Google Chrome > More Tools > Developer Tools to inspect the HTML code. We well find all those informtion in `<li>...</li>`
* **Step 2:** Find all these tags using `BeautifulSoup`. We want to select a subset of them.
* **Step 3:** Use the `.split` function to separate the year, name and institute. Specify an optional argument k in `.split` to split only on the k-th occurrence.

In [7]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [8]:
res = requests.get("https://en.wikipedia.org/wiki/COPSS_Presidents%27_Award")

soup = BeautifulSoup(res.content, "html.parser")

lists = soup.find_all("li")

In [9]:
lists

[<li class="mw-list-item" id="n-mainpage-description"><a accesskey="z" href="/wiki/Main_Page" title="Visit the main page [z]"><span>Main page</span></a></li>,
 <li class="mw-list-item" id="n-contents"><a href="/wiki/Wikipedia:Contents" title="Guides to browsing Wikipedia"><span>Contents</span></a></li>,
 <li class="mw-list-item" id="n-currentevents"><a href="/wiki/Portal:Current_events" title="Articles related to current events"><span>Current events</span></a></li>,
 <li class="mw-list-item" id="n-randompage"><a accesskey="x" href="/wiki/Special:Random" title="Visit a randomly selected article [x]"><span>Random article</span></a></li>,
 <li class="mw-list-item" id="n-aboutsite"><a href="/wiki/Wikipedia:About" title="Learn about Wikipedia and how it works"><span>About Wikipedia</span></a></li>,
 <li class="mw-list-item" id="n-contactpage"><a href="//en.wikipedia.org/wiki/Wikipedia:Contact_us" title="How to contact Wikipedia"><span>Contact us</span></a></li>,
 <li class="mw-list-item" id

In [10]:
char = []

ind = []

for li in lists:
    chars = li.text.split(": ")
    if len(chars) >= 2:
        element = chars[1].split(",", 1)
        if len(element) >= 2:
            char.append([element[0], element[1]])
            ind.append(chars[0])

char = pd.DataFrame(char, index=ind, columns=["Name", "Institue"])

char.head()

Unnamed: 0,Name,Institue
1981,Peter J. Bickel,"University of California, Berkeley"
1982,Stephen Fienberg,Carnegie Mellon University
1983,Tze Leung Lai,Stanford University
1984,David V. Hinkley,"University of California, Santa Barbara"
1985,James O. Berger,Duke University


## For each of these statisticians, find their year of birth. Create another column in your dataframe showing those information.

* **Step 1:** Go to the website https://en.wikipedia.org/wiki/Peter_J._Bickel, use Google Chrome > More Tools > Developer Tools to inspect the HTML code.
* **Step 2:** Find the tag containing the birth of year informtion and get the information using `BeautifulSoup`. We can use `soup.find(text='Born')` and `.findNext('td')`. To get birth of year from a give string, we can split it into two parts by 19 and get the first two characters in the second element. 
* **Step 3:** To repeat this procedure for each of those statisticians in your dataframe, try something like this `'https://en.wikipedia.org/wiki/' + name`. It does not contain year of birth for all statisticians. If there are no such information for a particular statistician, you can set its value to be `NA`. 

In [4]:
res = requests.get("https://en.wikipedia.org/wiki/Peter_J._Bickel")

soup = BeautifulSoup(res.content)

soup.find(string="Born").findNext("td").text

'1940 (age\xa082–83)Bucharest, Romania'

In [5]:
res = requests.get("https://en.wikipedia.org/wiki/COPSS_Presidents%27_Award")

soup = BeautifulSoup(res.content, "html.parser")

lists = soup.find_all("li")

In [6]:
char = []

ind = []

for li in lists:
    row = []
    chars = li.text.split(": ")
    if len(chars) >= 2:
        element = chars[1].split(",", 1)
        if len(element) >= 2:
            row = [element[0], element[1]]
            ind.append(chars[0])
            try:
                link = li.find("a").get("href")
                res = requests.get("https://en.wikipedia.org" + link)
                soup = BeautifulSoup(res.content, "html.parser")
                ths = soup.find(string="Born")
                if ths is None:
                    row.append("NA")
                else:
                    strings = ths.findNext("td").text
                    row.append("19" + strings.split("19")[1][0:2])
            except:
                row.append("NA")
            char.append(row)

char = pd.DataFrame(char, index=ind, columns=["Name", "Institue", "YOB"])

char.head()

  ths = soup.find(text="Born")


Unnamed: 0,Name,Institue,YOB
1981,Peter J. Bickel,"University of California, Berkeley",1940
1982,Stephen Fienberg,Carnegie Mellon University,1942
1983,Tze Leung Lai,Stanford University,1945
1984,David V. Hinkley,"University of California, Santa Barbara",1944
1985,James O. Berger,Duke University,1950


### Example: Find the names of all characters in a book


* https://en.wikipedia.org/wiki/The_Dark_Forest

One way of doing it using very few bs4 commands:

In [11]:
res = requests.get("https://en.wikipedia.org/wiki/The_Dark_Forest")

soup = BeautifulSoup(res.content, "html.parser")

lists = soup.find_all("li")


for val, el in enumerate(lists):
    print("index: " + str(val) + "text: " + el.text)
 
chars = []

for i in range(63,87):
    char = lists[i].text.split(" – ")[0]
    chars.append(char)


print(chars)


index: 0text: Main page
index: 1text: Contents
index: 2text: Current events
index: 3text: Random article
index: 4text: About Wikipedia
index: 5text: Contact us
index: 6text: Donate
index: 7text: Help
index: 8text: Learn to edit
index: 9text: Community portal
index: 10text: Recent changes
index: 11text: Upload file
index: 12text: Create account
index: 13text: Log in
index: 14text:  Create account
index: 15text:  Log in
index: 16text: Contributions
index: 17text: Talk
index: 18text: 

(Top)


index: 19text: 


1Plot




index: 20text: 


2Characters



Toggle Characters subsection





2.1Wallfacers (面壁者)






index: 21text: 


2.1Wallfacers (面壁者)




index: 22text: 


3Trilogy




index: 23text: 


4Videos




index: 24text: 


5See also




index: 25text: 


6References




index: 26text: 


7External links




index: 27text: العربية
index: 28text: Deutsch
index: 29text: Español
index: 30text: Français
index: 31text: 한국어
index: 32text: Italiano
index: 33text: Magyar
index: 34text: 日本語

Another way of doing it using more bs4 commands:

In [12]:
res = requests.get("https://en.wikipedia.org/wiki/The_Dark_Forest")

soup = BeautifulSoup(res.content, "html.parser")

character_section = soup.find(id="Characters").find_next('ul').find_all('li')

characters = []

for char in character_section:
    characters.append(char.text.split(" – ")[0])

print(characters)


['Ye Wenjie (叶文洁)', 'Mike Evans (麦克·伊文斯)', 'Wu Yue (吴岳)', 'Zhang Beihai (章北海)', 'Chang Weisi (常伟思)', 'George Fitzroy', 'Albert Ringier', 'Yang Jinwen (杨晋文)', 'Miao Fuquan（苗福全）– Shanxi coal boss; neighbor to Zhang and Yang', 'Zhang Yuanchao (张援朝)', 'Shi Qiang (史强), also nicknamed Da Shi (大史)', 'Shi Xiaoming (史晓明)', 'Kent (坎特)', 'Secretary General Say: Secretarial General of the United Nations, oversees the creation of the Wallfacer project and selects Luo Ji, gambling that his importance to Trisolaris might generate success for humanity. Attempts to create a Human Memorial Project to catalogue human culture, which is destroyed for being defeatist by the Planetary Defense Council.', 'Yamasugi Keiko (山杉恵子)', 'Garanin', 'Ding Yi (丁仪)', 'Zhuang Yan (庄颜)', 'Ben Jonathan', 'Dongfang Yanxu (东方延绪)', 'Major Xizi']
