<font color=darkred>

# Soc220: Computational Text Analysis
## Lab2: Webscraping With Beautiful Soup

<br>

![pip](https://www.pcs.org/assets/uploads/GE_Illustration_Large750.jpg)

***
    2/1/2018
    (Image: Pip and Magwitch)

# Solutions to homework:

In [None]:
primes = []
for j in range(2,101):
    #count up to each number
    for i in range(2,j):
        #check if there are any numbers that can divide
        if j % i == 0:
            #and if there are, stop
            break
    else:
        #since not, print
        primes.append(j)
        print(j)       

In [None]:
def isprime(j):
    '''
    Takes in a positive number and checks whether it is prime. Returns number if prime.
    
    Int -> Int
    '''
    for i in range(2,j):
        if j % i == 0:
            break     
    else:
        return(j)

In [None]:
isprime(19)

In [None]:
isprime(21)

In [None]:
isprime(19)

In [None]:
primes_to_100 = [k for k in range(1, 100) if isprime(k)]
primes_to_100

***
***

Today we're going to scrape some data from the court records from the Old Bailey, the criminal court of London from the mid-17th century up until WW1. Thanks to the British National Library, every single criminal proceeding (transcripts, charges, verdict sentences etc) are freely accesible online. This week, we'll scrape from the front-end of the website, that is, we'll write some code to automatically load a webpage, find some content, and then save it. 

A few notes on webscraping:
- This is an exremely flexible way to collect data. If you can view it in a browser, then it's possible, with enough patience, to get that data.
- However, the way we teach a computer to get these data is to look through the HTML code for specific tags, which means that if the web-developer for a given website decides to make a slight change in that coding or if they themselves screwed-up or if the code is a bit inconsistent, then you can be SOL in getting data. Possibly creating bias in your collection process!
- In addition, we'll probably have to use 'regular expressions' which is a very old-school coding language all unto itself for parsing strings. We'll use this to sort through the types of text that we want which are held within specific HTML tags.

The new [firefox browser](https://www.mozilla.org/en-US/firefox/) is great for viewing html code on a website.

In [None]:
# First, we load libraries: 
######

import re
import requests
from urllib.request import urlopen

from bs4 import BeautifulSoup
import pickle
import json
import os




![obo_online](obo_online.png)


***
<font color=darkgreen>
    
Charles Dickens was once a court reporter at the Old Bailey and witnessing this case was the inspiration for the character of Magwitch, the plot twist of Great Expectations.


[Persecution of Thomas Knight](https://www.oldbaileyonline.org/browse.jsp?id=t18331017-6-off35&div=t18331017-6)

![knight](knight.png)

<font color=darkred>

## General strategy:

1. Find a list of links, each of which contains some text that we are looking for. In other contexts, this could be a list of speeches, a list of press releases, a list of publications, a list of novels even -- basically anything.
2. Identify a regular expression or HTML code tag that on each of those pages gives us the text in question that we want.
3. Extract that text, pause for a moment (n.b. ALWAYS have a built-in pause in order to not alarm website admins. For those of you who decide to become black-hat hackers, this automated-request of a website is identical to what a DDOS is), then move on to the next page and rinse and repeat.
4. We'll also want to include error code expections so that if one page has some issue with it (null value for instance), we don't want the entire scrape to end, especially if we are planning to scrape over the course of 10 to 12 hours. We will wish to record that fact as well.

<font color=darkgreen>
    
### Step 1: Figuring out the text data we want.

Here I wish to see all cases which involved Britons who were exiled (and then came back).



Next week, we'll use the full API. And we'll also discuss going to 'private' web pages.

https://www.oldbaileyonline.org/forms/formCustom.jsp

'The boxes below allow you to search the whole of the Proceedings and all published Ordinary of Newgate's Accounts (for the period from 1679 to 1772)'

Search for those tried for 'returning from transportation' (same sentence as Magwitch):

https://www.oldbaileyonline.org/search.jsp?gen=1&form=searchHomePage&_offences_offenceCategory_offenceSubcategory=miscellaneous_returnFromTransportation&start=0&count=0

So, from this angle, we'll get a piece of what we want, but next week, with APIs, we'll dig a little deeper.

Let's start with this case, someone convicted of stealing a handkerchief:

https://www.oldbaileyonline.org/browse.jsp?id=t17940115-36-off187&div=t17940115-36

In [None]:
# we first take the URL 
url = "https://www.oldbaileyonline.org/browse.jsp?id=t17940115-36-off187&div=t17940115-36"
# and then we request it, as if we are loading an indiviudal web page.
req = requests.get(url,timeout=20) #always include a delay!
#For those curious,the actual code for a DDoS attack is pretty similar to page request.
req.status_code #200 means we have gotten it correctly

In [None]:
req.html = req.text #extract the text from the request
soup = BeautifulSoup(req.html,"html.parser") #parse the text so the computer can read the html tags
print(soup.prettify()[2500:3000]) #print out what the soup looks like

![html_code_structure](http://www.openbookproject.net/tutorials/getdown/css/images/lesson4/HTMLDOMTree.png)

<font color=darkgreen>
    
**Essentially what we're doing is looking for that last row down here, in order to find the relevant text.**

<font color=darkgreen>

### **Html code structure**

- Upon request, HTML is parsed into DOM (DOcument Object Model)
- All content we want is generally at bottom of this tree.
- The 'art' / frustrating part of webscrapping is identifying what piece of the treet to navigate down.
- The breakpoints for a webscrape are when this tree has some inconsistency (null data or things are structured differently.
- So again, if you can view it in a browser, then you can scrape it. If there's any break in this structure, then your code is going to fail b/c it relies on this structure to navigate for data.


### Navigating this structure:

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-the-tree

1. Find a given tag
2. Find a tag and ask for its children
3. Find a tag and ask for parents
4. Ask for siblings (navigate sideways)

Otherwise:

1. Search for an expression `soup.find_all('b')`
2. Search using a regular expression. `re.compile("t")`

<font color=darkgreen>

### Regular expressions

- Functionally, another distinct language
- Raw method of parsing strings

Regular expression cheat sheet:

https://regexr.com

Datacamp tutorial:

https://www.datacamp.com/community/tutorials/python-regular-expression-tutorial

Python docs:

https://docs.python.org/2/library/re.html

Table:

| Expression | Search                         |
|------------|--------------------------------|
| \d         | Any Digit                      |
| \D         | Any Non-digit character        |
| .          | Any Character                  |
| \.         | Period                         |
| [abc]      | Only a, b, or c                |
| [^abc]     | Not a, b, nor c                |
| [a-z]      | Characters a to z              |
| [0-9]      | Numbers 0 to 9                 |
| \w         | Any Alphanumeric character     |
| \W         | Any Non-alphanumeric character |
| {m}        | m Repetitions                  |
| {m,n}      | m to n Repetitions             |
| *          | Zero or more repetitions       |
| +          | One or more repetitions        |
| ?          | "zero or one occurrences of the preceding element."              |
| \s         | Any Whitespace                 |
| \S         | Any Non-whitespace character   |
| ^…$        | Starts and ends                |
| (a(bc))    | Capture Sub-group              |

https://github.com/zeeshanu/learn-regex

In [None]:
import pandas as pd

#Let's imagine we have a series of webpages with this text on it, a bit of 
#the Captain Ahab's final words

class_sentences = ["Monday: Homework is due at 11:59am",
                  "Wednesday: Class is at 2:00pm. Review papers!",
                  "Thursday: Lab is at 5:00pm"]

df = pd.DataFrame(class_sentences, columns=['text'])
df

In [None]:
# select dataframe, the column 'final_words', call str, and then use, len
#method on that string
# find number of characters
df['text'].str.len()

# n.b. if you wish to convert something to a string, wrap it in a str() function

<font color=darkred>

Are there 177 words or 177 characters?

In [None]:
#tokens for each string 
# call the split function to break up on white spaces, and then count len of 
#those split up strings
df['text'].str.split().str.len()

In [None]:
#search through final words, use string method of contains to find 'Whale'
df['text'].str.contains('Lab')

In [None]:
# regualar expressions being w/ a r before the string construct quotes
# \d looks for digits
df['text'].str.contains(r'\d')

In [None]:
# group and find digits
df['text'].str.findall(r'(\d?\d):(\d?\d)')

In [None]:
#find all 'I'
df['text'].str.extractall(r'(is)')

In [None]:
#between two digits ':" then zero or more of the group of either [ap] m
df['text'].str.extractall(r'((\d?\d):(\d\d) ?([ap]m))')

<font color=darkgreen>

### Open developer tools. Tools --> Toggle Developer Tools

https://www.oldbaileyonline.org/browse.jsp?id=t17940115-36-off187&div=t17940115-36

In [None]:
soup = BeautifulSoup(req.html,"html.parser") #n.b. different parsers can mean very different structures
print(soup.prettify()[9500:10500])

In [None]:
# first 20 tags
[tag.name for tag in soup.find_all()][:20]

In [None]:
# the head, is there anything useful up here?
soup.head

In [None]:
#soup.body

In [None]:
# let's find all paragraphs of text in the body of the webpage.
soup.body.find_all('p')

In [None]:
# The result of this search is a bs4 result
type(soup.body.find_all('p'))

In [None]:
# we have to use the .text or .string method to extact the actual string or text that we want
for p_tag in soup.find_all('p'):
    print(p_tag.text)

<font color=darkgreen>
    
However, I get all this crap at the end that I don't want to have in there. Plus I also want to extract some unqiue values: name of target, date, their sentence, and then the text of the trial. This is is all in one big blob.

Let's say I wanted to create this Python data structure (next time we'll discuss storing this externally):

| Defendant Name | Date of Trial | Sentence | Text of Trial |
|----------------|---------------|----------|---------------|
| Joshua Daniels  | 1/15/1794      | Death    |  JOHN OWEN sworn. I am a servant to Mr. Kirby. The prisoner was tried here in September sessions last, for stealing an handkerchief, and was convicted, and received sentence to be transported....  |

<font color=darkgreen>
    
### Step 2. Parse the html with BeautifulSoup, that is, turn the raw text into something that the machine can read through.

BeautifulSoup has a few crucial functions for us:

- `soup.prettify()`: Returns cleanedup version of raw html for printing
- `soup.find_all()`: Returns Python list of matching objects
- `soup.find()`:Returns first matching object for that item.
- `soup.text/soup.get_text()`: Returns visible text of an object (e.g.,"<p>Some text</p>" -> "Some text")


Full documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

#### Right-click, 'View source'

<font color=darkgreen>
    
So, let's go look at the developer tools on each page and see if we can find a structure that works. First the title:

![](title.png)

In [None]:
# first the title
soup.find('div', class_='sessionsPaperTitle')

In [None]:
# extract just the text of the title
soup.find('div', class_='sessionsPaperTitle').get_text()

<font color=darkgreen>
    
**Next, just the text of the trial.**

And then next, the body, with the text in question:

![](body.png)

[//]: ![body_trail](body_detail.png)

What we see here is that we want all the paragraphs within "div sessions paper" tag.

In [None]:
soup.find('div', class_='sessionsPaper')

In [None]:
# just the paragraphs in the div class=_sessionsPaper tag
soup.find('div', class_='sessionsPaper').find_all('p')

In [None]:
#for p in soup.find('div', class_='sessionsPaper').find_all('p'):
#    #get text per tag from within.
#    print(p.get_text())

In [None]:
# so, if we wish to write all these paragraphs into a single entity, we'll take them as a list.

trial_text = [] #blank list, must be outside of 4 loop
for p in soup.find('div', class_='sessionsPaper').find_all('p'):
    trial_text.append(p.get_text()) #append each paragraph to blank list

In [None]:
# sanity check, list of paragraphs from website.
trial_text[2:4]

<font color=darkgreen>
    
So, now we are going to take out a string with some identifying information in it and we're going to extract a list of paragraphs. At this point, you might be asking "Zach, what the hell, data in this form is useless to me!" Well, yes, we're going to start data wrangling next week. For right now, we just want to get it off the website and into a Python data object and save it locally.

Later in your text analysis career, you'll find it useful to integrate cleaning and scraping simultaneously, right now, we're just going to focus on scrapping.

** 
However, now we know what we want from each page: a single string which contains information about the trial (name of defendant and date of trial) and then a list of strings of the actual text of the case.)
**

<font color=dark>
#### A word of warning: XML VS HTML

- HTML tells your browser how to display information
- XML is a rich-text document which lists which information is which.
- One can save things directly as XML if one has the local storage space.
- Instead of using an HTML parser, instead get the XML page and then parse that.

To view as XML:

https://www.oldbaileyonline.org/browse.jsp?foo=bar&path=sessionsPapers/17940115.xml&div=t17940115-36&xml=yes

***

<font color=darkgreen>
    
### 3. Get a list of links to the pages with the data.

https://www.oldbaileyonline.org/search.jsp?gen=1&form=searchHomePage&_offences_offenceCategory_offenceSubcategory=miscellaneous_returnFromTransportation&start=0&count=0

#### Here we encounter where the web design is inconsistent. We can't deal with 'start=0' so we have to 'hack it' and manually add that page at the end.

https://www.oldbaileyonline.org/search.jsp?gen=1&form=searchHomePage&_offences_offenceCategory_offenceSubcategory=miscellaneous_returnFromTransportation&count=391&start=10

https://www.oldbaileyonline.org/search.jsp?gen=1&form=searchHomePage&_offences_offenceCategory_offenceSubcategory=miscellaneous_returnFromTransportation&count=391&start=20

In [None]:
# so here is the base url from our search
url_search = 'https://www.oldbaileyonline.org/search.jsp?gen=1&form=searchHomePage&_offences_offenceCategory_offenceSubcategory=miscellaneous_returnFromTransportation&count=391&start='

In [None]:
#we also want to count from 0 to 390 in increments of 10
range(10,400,10)
print(list(range(10,390,10)))

In [None]:
list_urls = []
for i in range(10,400,10):
    #append to our blank list the base url append to the 
    list_urls.append(url_search+str(i))

In [None]:
list_urls[:5]

In [None]:
# Hack and add this in at the end.

list_urls.append('https://www.oldbaileyonline.org/search.jsp?gen=1&form=searchHomePage&_offences_offenceCategory_offenceSubcategory=miscellaneous_returnFromTransportation&start=0&count=0')

In [None]:
# Check the first URL, which is the second page.
list_urls[0]

In [None]:
# Check the last URL, which is the first page.
list_urls[-1]

<font color=darkgreen>

So, now we have a list of URLS which contain a list of the pages of interest. Now we have to extract urls of the pages of interest.

https://www.oldbaileyonline.org/search.jsp?gen=1&form=searchHomePage&_offences_offenceCategory_offenceSubcategory=miscellaneous_returnFromTransportation&start=0&count=0

In [None]:
url_list = "https://www.oldbaileyonline.org/search.jsp?gen=1&form=searchHomePage&_offences_offenceCategory_offenceSubcategory=miscellaneous_returnFromTransportation&start=0&count=0"
req = requests.get(url_list,timeout=20)
req.status_code #again check status

In [None]:
req.html = req.text
soup = BeautifulSoup(req.html,"html.parser")
#print(soup.prettify())

<font color=darkgreen>
    
Go to search results page and then get a link to the page trial page.

![search_results](search_results.png)

In [None]:
# now we use a regular expression on the href to select only those hrefs that start with the unique string
# each tag option can be subsetted to select a part
for a_tag in soup.find_all('a', href = re.compile('#highlight')):
    print(a_tag['href'])

<font color=darkgreen>
### **Now we extract the links to the unique cases.**

In [None]:
#n.b. we gotta keep these outside for-loops

list_trial_urls = []
counter = 0

for url_ in list_urls:
    counter += 1
    
    #get url and soup the page
    url = url_
    req = requests.get(url,timeout=20)
    req.html = req.text
    soup = BeautifulSoup(req.html,"html.parser")
    
    print('Page',counter,'of search results added.')
    
    #then get the trial urls and add them to big list of trial urls
    for a_tag in soup.find_all('a', href = re.compile('#highlight')):
    #    list_urls_page = []
        print('https://www.oldbaileyonline.org/'+a_tag['href'])
        list_trial_urls.append('https://www.oldbaileyonline.org/'+a_tag['href'])
    
    
    

In [None]:
#sanity check: should be 391
len(set(list_trial_urls))

<font color=darkgreen>
    
Check if list url is the last on page '39':

https://www.oldbaileyonline.org/browse.jsp?id=t18810502-481-offence-1&div=t18810502-481#highlight

<font color=darkgreen>
    
## 4. Last step! Now we loop through all the urls of trials, extract the information in question, and save them to two Py lists, and then we'll zip those into a dictionary and pickle it.

- We will want to include a 'Try-Except' clause as well in order to let the scrape continue if it hits any snags.

In [None]:
#two empty lists
text_of_trials = []
title_of_trials = []

#change to an empty directory to save raw text files
os.chdir("data_dump")

for u in list_trial_urls:
    
    # get the soup from each page of URL
    url = u
    req = requests.get(url,timeout=20)
    req.html = req.text
    soup = BeautifulSoup(req.html,"html.parser")
    
    #get the title of the trial
    try:
        title_of_trial = soup.find('div', class_='sessionsPaperTitle').get_text()
        title_of_trials.append(title_of_trial)
        print('Processed the title of trial:',len(title_of_trials))
    except:
        print('ERROR on title of trial!')
    
    try:
        #get the trial text
        trial_text = [] #blank list, must be outside of 4 loop
        #find all links on each page
        for p in soup.find('div', class_='sessionsPaper').find_all('p'):
            trial_text.append(p.get_text()) #append each paragraph to blank list
    
        text_of_trials.append(trial_text)
        print('Processed the test of trial:',len(text_of_trials))
    except:
        print("ERROR on text of trial!")
    
    
    #dict_trial = dict(zip(title_of_trials,trial_text))
    #pickle.dump(dict_trial,open("obo_trials_returningfromtransport.p","wb"))
    
    #write out
    f = open(title_of_trial,'w')
    f.write(str(trial_text))
    f.close()
    
    
    #write out and save local object
    #dict_return_transit_trials = dict(zip(title_of_trials,text_of_trials))
    #with open('obo_transit_trails.json','w') as outfile: #open local file, write to it
    #    outfile = json.dumps(dict_return_transit_trials) #send to outfile this python bit
    
    
    
    

In [None]:
#data check
title_of_trials[20:25]

In [None]:
#data check
text_of_trials[20:25]

In [None]:
<font color=darkgreen>

### 5. Save as pickle object

- Pickle is the way to save python objects on local disk.
- Probably a good idea to pickle while scraping for large scrapes in case there's an error b/c then you'll at least get a chunk of it.
- Next week, we'll discuss storing them in more human-friendly ways.
- For now, your goal should be to find a place with some interesting data and then scraping it.

In [None]:
# zip each of our lists into a dictionary, in which the key is the title and the value is the text of the trial.
obo_return_transit_trials = dict(zip(title_of_trials,text_of_trials))

In [None]:
len(obo_return_transit_trials)

In [None]:
import pickle

# wb stands for "write binary", that is, literally store this pickle as binary code
pickle.dump(obo_return_transit_trials, open("obo_return_transit_trials.p","wb"))

In [None]:
### Save as JSON

* Essentially "python dictionary object, but the local version!"

In [None]:
import json

dict_return_transit_trials = dict(zip(title_of_trials,text_of_trials))
with open('obo_transit_trails.json','w') as outfile: #open local file, write to it
    outfile = json.dumps(dict_return_transit_trials) #send to outfile this python bit

***
### Data wrangling in two weeks. For now, we're just focused on pulling the data from offline.


### How to srape a website that requires a login:

http://kazuar.github.io/scraping-tutorial/


- Same as above, but you will have a dictionary containing the login information that will be fed into your request command.

***
***

<font color=darkred>

# Homework

http://www.presidency.ucsb.edu/sou.php

### Scrape all of SOTU speeches. Store them as dictionary objects with each key as "lastname_firstname_date" and the value as the text (raw) of each speech. Upload these data to Harvard GDrive (n.b. $\infty$ space), Dropbox, or Github and post a link to that on Canvas. Also include your IPython notebook.

(n.b. In your script, your files should be saving to a place where they are automatically uploaded. This is good practice for getting large datasets and not overloading your storage. The `os.chdir` command changes the work directory to another folder alongside choosing to have that folder auto upload.)