## String Games: Getting Started with Text Extraction in Python

![A design in Cat's Cradle known as Ruapehu and Tongariro.](https://i.pinimg.com/736x/1e/2d/1a/1e2d1a97d166f62ec5b4579bf5128765.jpg)

### Outline

* Note before we begin (5 mins)
* Two typical scenarios (Get clean text for analysis, Get tabular data for analysis)
    - Scenario 1: Extracting text from HTML with BeautifulSoup (20 mins)
    - Scenario 2: Extract POS, named entities, and (rule-based) text patterns with spaCy (20 mins)
* Your own scenarios (30 mins)
* Questions / wrap-up (10 mins)

### Note before we begin

This type of workshop is often taught using interactive online notebooks. These make it easy to run small pieces of code in isolation, building step-by-step towards some goal. They allow everyone to work in the same environment and get the same results (we hope!). They also allow text, headings, links etc to be interspersed in the code. 

However, you can also use a text editor program (such as Notepad++ for Windows, or TextWrangler for Mac), or various IDEs (integrated development environment programs). There are lots of ways to set up your computer to do Python programming, and the Programming Historian website has [a good guide](https://programminghistorian.org/en/lessons/introduction-and-installation) for this for Windows, Mac and Linux.

If you want to use Jupyter notebooks (as we are here), you can set this up, along with many of the Python libraries you might want, by installing the [Anaconda Community distribution](https://www.anaconda.com/distribution/).

This notebook is running through My Binder, a service that provides a server and programming environment to interact with the notebook online.

##### Running code cells

To run a cell, you need to click on it to select it, then press the 'Run' button on the menu bar, or press Shift+Enter (the latter is much more convenient, once you get used to it).

### Scenario 1: Extract plain text from HTML with BeautifulSoup

In [3]:
# These lines starting with # are comments - they don't 'do' anything
# Import 'requests', a library to help retrieve web pages
import requests

In [7]:
# We store a URL, giving it the variable name 'url'
# This is our first example of a 'string variable' - the type of data we are mainly concerned with here.
url = 'https://ucdh.github.io/scraping-garden-party/at-the-bay.html'

# The requests library will retrieve various information about the page
# By convention, we use 'r' to denote the object storing this information
r = requests.get(url)

# Print the status code and first 300 characters of the webpage text
print(r.status_code)
print(r.text[:300])

200
<!DOCTYPE html>
<html lang="en-US">
  <head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">

<!-- Begin Jekyll SEO tag v2.5.0 -->
<title>At the Bay | Scraping Garden Party</title>
<meta nam


#### Task 1
Go to [Geoff's mockup page about Katherine Mansfield's The Garden Party](https://ucdh.github.io/scraping-garden-party/at-the-bay.html) and choose a new story to download. In the cell below, do the following:

* copy and modify the URL to be requested in order to download a different story of your choice.
* use Shift+Enter to run the cell
* modify the code so it prints the first 400 characters of the webpage text

In [8]:
# Your code here


In [5]:
# BeautifulSoup can parse web page structure, much like a web browser does
from bs4 import BeautifulSoup

In [6]:
url = 'https://ucdh.github.io/scraping-garden-party/at-the-bay.html'

r = requests.get(url)

# We create a BeautifulSoup object, using a parser library called 'lxml'
# lxml isn't imported directly, but is a dependency (or requirement) here
soup = BeautifulSoup(r.text, 'lxml')

In [8]:
# Select an element of the web page for display
soup.title

<title>At the Bay | Scraping Garden Party</title>

Like ```r```, ```soup``` is an object. Objects _bundle together_ many variables (properties of an object)  as well as methods (actions we can perform on the object).

```soup.title``` gave us the page title. We can access other webpage properties you're probably familiar with, too. 

#### Task 2

Change the last line in the cell above to display the following page elements (or 'tags', for short) stored in the ```soup``` object:

- img
- h1
- p
- a

So far we can get webpage elements / tags, but only the first example of any given tag. Learn more about BeautifulSoup [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), such as the ```find_all()``` method that helps us extract all instances of a given element. 

#### Task 3
Collect all the links stored in the ```soup```variable by replacing ```%``` character in the code below.

In [None]:
links = soup.find_all('%')

for link in links:
    print(link)

Finally, BeautifulSoup has a nice helper method to extract the clean text.

In [None]:
clean_text = soup.find(id='page-content').get_text() # Try removing the '.get_text()' bit to see the difference.
clean_text

You can remove the output from a selected cell by going to Cell > Current Outputs > Clear in the menu bar.

### Scenario 2: Extract parts of speech, named entities and text patterns with spaCy

In [24]:
import spacy

In [25]:
# This step assumes a spaCy language model (eg 'en_core_web_sm') has been installed
nlp = spacy.load('en_core_web_sm')

#### Parts of Speech

In [26]:
# Sample text from 'At the Bay'. You can substitute other examples here
# Just make sure you retain the triple quotes, which are used for multi-line strings and code comments
sample_text = '''And now they had passed the fisherman’s hut, passed the charred-looking little whare where Leila the milk-girl lived with her old Gran.
The sheep strayed over a yellow swamp and Wag, the sheep-dog, padded after, rounded them up and headed them for the steeper, narrower rocky pass that led out of Crescent Bay and towards Daylight Cove.
'''

doc = nlp(sample_text)

In [28]:
# display each word (aka 'token') and its part of speech
for token in doc:
    print(token.pos_, token.text)

CCONJ And
ADV now
PRON they
VERB had
VERB passed
DET the
NOUN fisherman
PROPN ’s
NOUN hut
PUNCT ,
VERB passed
DET the
VERB charred
PUNCT -
VERB looking
ADJ little
NOUN whare
ADV where
PROPN Leila
DET the
NOUN milk
PUNCT -
NOUN girl
VERB lived
ADP with
DET her
ADJ old
PROPN Gran
PUNCT .
SPACE 

DET The
NOUN sheep
VERB strayed
ADP over
DET a
ADJ yellow
NOUN swamp
CCONJ and
PROPN Wag
PUNCT ,
DET the
NOUN sheep
PUNCT -
NOUN dog
PUNCT ,
VERB padded
ADP after
PUNCT ,
VERB rounded
PRON them
PART up
CCONJ and
VERB headed
PRON them
ADP for
DET the
NOUN steeper
PUNCT ,
ADJ narrower
ADJ rocky
NOUN pass
DET that
VERB led
ADP out
ADP of
PROPN Crescent
PROPN Bay
CCONJ and
ADP towards
PROPN Daylight
PROPN Cove
PUNCT .
SPACE 



In [30]:
# the displacy library visualises the dependency tree, giving us a fancier view
from spacy import displacy
displacy.render(doc, style="dep", options={'compact': True, 'distance': 100})

PS Did you spot any errors? These statistical models are good, but not perfect.

#### Task 3

Choose another sentence or two containing some interesting words or names (or 'plausible distractors') to test on spaCy's model. Paste these into the variable ```sample_text``` above and re-run it, then re-run the two code cells that follow. 

#### Extracting parts of speech from longer texts

Parts of speech are very useful for identifying linguistic features of interest. For example, we might extract adjectives from a whole story with the following code (and this could be scaled up to all stories in a collection, and beyond). Such techniques can be helpful for literary, discourse or content analysis.

In [29]:
# We'll make a new variable 'doc2', so as not to get confused!
# Remember, 'clean_text' is the variable output from BeautifulSoup.
doc2 = nlp(clean_text)

In [30]:
# We'll import an object to help us count words
from collections import Counter

# The next line is a loop, but reformulated into one line
# The technique is called a 'list comprehension'
adjectives = [token.text for token in doc2 if token.pos_ == 'ADJ']

In [58]:
freq = Counter(adjectives)

In [32]:
freq.most_common(15)

[('little', 66),
 ('old', 27),
 ('other', 20),
 ('blue', 19),
 ('small', 17),
 ('big', 16),
 ('white', 14),
 ('long', 13),
 ('black', 13),
 ('same', 12),
 ('bright', 12),
 ('red', 11),
 ('silly', 11),
 ('own', 11),
 ('good', 11)]

There are other Python libraries you should consider for this kind of linguistic analysis, notably the Natural Language Tool Kit (NLTK). For matching patterns, try spaCy's rule-based pattern matcher, and, for more fine-grained control, learn how to use regular expressions.

#### Task 4

Select a different story by changing the ```url``` variable way back in Scenario 0, and process it with BeautifulSoup so that it is stored in the variable ```clean_text```. Then re-run the previous four cells to see which adjectives are frequent in that story.

#### Extracting named entities

In [33]:
# Let's go back to our first example - "At The Bay" by Katherine Mansfield

sample_text = '''And now they had passed the fisherman’s hut, passed the charred-looking little whare where Leila the milk-girl lived with her old Gran.
The sheep strayed over a yellow swamp and Wag, the sheep-dog, padded after, rounded them up and headed them for the steeper, narrower rocky pass that led out of Crescent Bay and towards Daylight Cove.
'''

doc3 = nlp(sample_text)

In [37]:
for ent in doc.ents:
    print(ent.text, ent.label_)

’s ORG
Leila ORG
Gran PERSON
Wag PERSON
Crescent Bay LOC
Daylight Cove PERSON


Not very impressive - only 50% right if we count Wag the dog as a person (and we will!).

But for many types of text spaCy will do better than this.

In [55]:
# source: https://en.wikipedia.org/wiki/Katherine_Mansfield

media_text = '''Kathleen Mansfield Murry (née Beauchamp; 14 October 1888 – 9 January 1923) was a prominent New Zealand modernist short story writer and poet who was born and brought up in colonial New Zealand and wrote under the pen name of Katherine Mansfield. 
At the age of 19, she left New Zealand and settled in England, where she became a friend of writers such as D. H. Lawrence and Virginia Woolf. 
Mansfield was diagnosed with extrapulmonary tuberculosis in 1917; the disease claimed her life at the age of 34.
'''

In [53]:
doc4 = nlp(media_text)

In [54]:
for ent in doc4.ents:
    print(ent.text, ent.label_)

Kathleen Mansfield Murry PERSON
Beauchamp GPE
14 October 1888 DATE
New Zealand GPE
New Zealand GPE
Katherine Mansfield PERSON
the age of 19 DATE
New Zealand GPE
England GPE
D. H. Lawrence PERSON
Virginia Woolf PERSON
Mansfield PERSON
1917 DATE
the age of 34 DATE


### Structuring your data

In [None]:
# Todo

### Saving your data

#### Text files

#### CSV files

In [None]:
with open('datafile.tsv', 'a') as f:
    fields = ['title', 'funding_amount', 'funding_type', 'funding_subtype', 'production_company', 'genre', 'episodes_duration', 'primary_platform', 'date', 'synopsis']
    writer = DictWriter(f, fieldnames=fields, delimiter='\t')
    writer.writerows(all_data)