## String Games: Getting Started with Text Wrangling in Python

![A design in Cat's Cradle known as Ruapehu and Tongariro.](https://i.pinimg.com/736x/1e/2d/1a/1e2d1a97d166f62ec5b4579bf5128765.jpg)

### Outline

* Note before we begin (5 mins)
* Two typical scenarios (Get clean text for analysis, extract information for analysis)
    - Scenario 1: Extracting text from HTML with BeautifulSoup (20 mins)
    - Scenario 2: Extract POS and named entities with spaCy (20 mins)
* Your own scenarios (30 mins) (if time)
* Questions / wrap-up (5-10 mins)

### Note before we begin

This type of workshop is often taught using interactive online notebooks. These make it easy to run small pieces of code and better understand each step in the program. They allow everyone to work in the same environment and get the same results (we hope!), and also allow text, headings, links etc to be interspersed in the code. 

You can also write Python programs with a text editor (such as Notepad++ for Windows, or TextWrangler for Mac), or various IDEs (integrated development environment programs). There are lots of ways to set up your computer to do Python programming, and the Programming Historian website has [a good guide](https://programminghistorian.org/en/lessons/introduction-and-installation) for Windows, Mac and Linux.

If you want to use Jupyter notebooks (as we are here), I recommend you set this up, along with many of the Python libraries you might want, by installing the [Anaconda Community distribution](https://www.anaconda.com/distribution/).

For the purposes of the workshop, this notebook is running through My Binder, a service that provides a server and programming environment to interact with the notebook online.

##### Running code cells
The notebook is made up of cells, which can be run individually or together. You must run them in order as the programs are sequential - later parts rely on the completion of earlier parts.

To run a cell, you need to click on it to select it, then press the 'Run' button on the menu bar, or press Shift+Enter (the latter is much more convenient, once you get used to it). Working cell by cell helps break the program into pieces which can be understood more easily.

### Scenario 1: Extract plain text from HTML with BeautifulSoup

In [21]:
# These lines starting with # are comments - they don't 'do' anything
# Import 'requests', a library to help retrieve web pages
import requests

In [22]:
# We store a URL, giving it the variable name 'url'
# This is our first example of a 'string variable' - the type of data we are mainly concerned with here.
url = 'https://ucdh.github.io/scraping-garden-party/at-the-bay.html'

# The requests library will retrieve various information about the page
# By convention, we use 'r' to denote the object storing this information
r = requests.get(url)

# Print the status code and first 300 characters of the webpage text
print(r.status_code)
print(r.text[:300])

200
<!DOCTYPE html>
<html lang="en-US">
  <head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">

<!-- Begin Jekyll SEO tag v2.5.0 -->
<title>At the Bay | Scraping Garden Party</title>
<meta nam


#### Task 1
Go to [Geoff's mockup page about Katherine Mansfield's The Garden Party](https://ucdh.github.io/scraping-garden-party/at-the-bay.html) and choose a new story to download. In the cell below, do the following:

* copy and modify the URL to be requested in order to download a different story of your choice.
* use Shift+Enter to run the cell
* modify the code so it prints the first 400 characters of the webpage text

In [23]:
# Your code here


In [24]:
# BeautifulSoup can parse web page structure, much like a web browser does
from bs4 import BeautifulSoup

In [25]:
url = 'https://ucdh.github.io/scraping-garden-party/at-the-bay.html'

r = requests.get(url)

# We create a BeautifulSoup object, using a parser library called 'lxml'
# lxml isn't imported directly, but is a dependency (or requirement) here
soup = BeautifulSoup(r.text, 'lxml')

In [26]:
# Select an element of the web page for display
soup.title

<title>At the Bay | Scraping Garden Party</title>

Like ```r```, ```soup``` is an object. Objects _bundle together_ many variables (properties of an object)  as well as methods (actions we can perform on the object).

```soup.title``` gave us the page title. We can access other webpage properties you're probably familiar with, too. 

#### Task 2

Change the last line in the cell above to display the following page elements (or 'tags', for short) stored in the ```soup``` object:

- img
- h1
- p
- a

So far we can get webpage elements / tags, but only the first example of any given tag. Learn more about BeautifulSoup [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), such as the ```find_all()``` method that helps us extract all instances of a given element. 

#### Task 3
Collect all the links stored in the ```soup```variable by replacing ```%``` character in the code below.

In [27]:
links = soup.find_all('%')

for link in links:
    print(link)

Finally, BeautifulSoup has a nice helper method to extract the clean text.

In [28]:
clean_text = soup.find(id='page-content').get_text() # Try removing the '.get_text()' bit to see the difference.
clean_text

'\nAt the Bay\nI\nVery early morning. The sun was not yet risen, and the whole of\nCrescent Bay was hidden under a white sea-mist. The big bush-covered\nhills at the back were smothered. You could not see where they ended\nand the paddocks and bungalows began. The sandy road was gone and the\npaddocks and bungalows the other side of it; there were no white dunes\ncovered with reddish grass beyond them; there was nothing to mark which\nwas beach and where was the sea. A heavy dew had fallen. The grass was\nblue. Big drops hung on the bushes and just did not fall; the silvery,\nfluffy toi-toi was limp on its long stalks, and all the marigolds and\nthe pinks in the bungalow gardens were bowed to the earth with wetness.\nDrenched were the cold fuchsias, round pearls of dew lay on the flat\nnasturtium leaves. It looked as though the sea had beaten up softly in\nthe darkness, as though one immense wave had come rippling,\nrippling—how far? Perhaps if you had waked up in the middle of the\nni

You can remove the output from a selected cell by going to Cell > Current Outputs > Clear in the menu bar.

#### Write the story to a file

It's pretty easy to write the text to a file for later use. You could collect all the stories by looping over all the story page links.

In [29]:
with open('story.txt', 'w') as f:
    f.write(clean_text)
    
# this can be read as: open (or create) the file 'story.txt' in writing mode
# assign the open file object to the variable f
# write the contents of the 'clean_text' variable into f
# (implied by the with statement) close the file

### Scenario 2: Extract parts of speech, named entities and text patterns with spaCy

spaCy is a library for 'natural language processing' - extracting linguistic features and other useful information.

In [30]:
import spacy

In [31]:
# This step assumes a spaCy language model (eg 'en_core_web_sm') has been installed
# See https://spacy.io/usage/models#quickstart for details
nlp = spacy.load('en_core_web_sm')

#### Parts of Speech

In [32]:
# Sample text from 'At the Bay'. You can substitute other examples here
# Just make sure you retain the triple quotes, which are used for multi-line strings and code comments
sample_text = '''And now they had passed the fisherman’s hut, passed the charred-looking little whare where Leila the milk-girl lived with her old Gran.
The sheep strayed over a yellow swamp and Wag, the sheep-dog, padded after, rounded them up and headed them for the steeper, narrower rocky pass that led out of Crescent Bay and towards Daylight Cove.
'''

doc = nlp(sample_text)

In [33]:
# display each word (aka 'token') and its part of speech
for token in doc:
    print(token.pos_, token.text)

CCONJ And
ADV now
PRON they
VERB had
VERB passed
DET the
NOUN fisherman
PROPN ’s
NOUN hut
PUNCT ,
VERB passed
DET the
VERB charred
PUNCT -
VERB looking
ADJ little
NOUN whare
ADV where
PROPN Leila
DET the
NOUN milk
PUNCT -
NOUN girl
VERB lived
ADP with
DET her
ADJ old
PROPN Gran
PUNCT .
SPACE 

DET The
NOUN sheep
VERB strayed
ADP over
DET a
ADJ yellow
NOUN swamp
CCONJ and
PROPN Wag
PUNCT ,
DET the
NOUN sheep
PUNCT -
NOUN dog
PUNCT ,
VERB padded
ADP after
PUNCT ,
VERB rounded
PRON them
PART up
CCONJ and
VERB headed
PRON them
ADP for
DET the
NOUN steeper
PUNCT ,
ADJ narrower
ADJ rocky
NOUN pass
DET that
VERB led
ADP out
ADP of
PROPN Crescent
PROPN Bay
CCONJ and
ADP towards
PROPN Daylight
PROPN Cove
PUNCT .
SPACE 



In [34]:
# the displacy library visualises the dependency tree, giving us a fancier view
from spacy import displacy
displacy.render(doc, style="dep", options={'compact': True, 'distance': 100})

PS Did you spot any errors? These statistical models are good, but not perfect.

#### Task 3

Choose another sentence or two containing some interesting words or names (or 'plausible distractors') to test on spaCy's model. Paste these into the variable ```sample_text``` above and re-run it, then re-run the two code cells that follow. 

#### Extracting parts of speech from longer texts

Parts of speech are very useful for identifying linguistic features of interest. For example, we might extract adjectives from a whole story with the following code (and this could be scaled up to all stories in a collection, and beyond). Such techniques can be helpful for literary, discourse or content analysis.

In [35]:
# We'll make a new variable 'doc2', so as not to get confused!
# Remember, 'clean_text' is the variable output from BeautifulSoup.
doc2 = nlp(clean_text)

In [36]:
from collections import Counter

# here we are using a loop to collect adjectives
# it's just a more compact syntax, called a 'list comprehension'
adjectives = [token.text for token in doc2 if token.pos_ == 'ADJ']
freq = Counter(adjectives)

In [37]:
freq.most_common(20)

[('little', 66),
 ('old', 27),
 ('other', 20),
 ('blue', 19),
 ('small', 17),
 ('big', 16),
 ('white', 14),
 ('long', 13),
 ('black', 13),
 ('same', 12),
 ('bright', 12),
 ('red', 11),
 ('silly', 11),
 ('own', 11),
 ('good', 11),
 ('whole', 10),
 ('cold', 9),
 ('first', 9),
 ('open', 9),
 ('yellow', 8)]

There are other Python libraries you should consider for this kind of linguistic analysis, notably the Natural Language Tool Kit (NLTK). For matching patterns, try spaCy's rule-based pattern matcher, and, for more fine-grained control, learn how to use regular expressions.

#### Task 4

Select a different story by changing the ```url``` variable back in Scenario 1, and process it with BeautifulSoup so that it is stored in the variable ```clean_text```. Then re-run the previous four cells to see which adjectives are frequent in that story.

#### Extracting named entities

In [38]:
# Let's go back to our first example - "At The Bay" by Katherine Mansfield

sample_text = '''And now they had passed the fisherman’s hut, passed the charred-looking little whare where Leila the milk-girl lived with her old Gran.
The sheep strayed over a yellow swamp and Wag, the sheep-dog, padded after, rounded them up and headed them for the steeper, narrower rocky pass that led out of Crescent Bay and towards Daylight Cove.
'''

doc3 = nlp(sample_text)

In [39]:
for ent in doc.ents:
    print(ent.text, ent.label_)

’s ORG
Leila ORG
Gran PERSON
Wag PERSON
Crescent Bay LOC
Daylight Cove PERSON


Here we will count Wag the dog as a person.

Fiction is a little more challenging for entity extraction, but for many types of text spaCy will do very well. Here is another example:

In [40]:
# source: https://en.wikipedia.org/wiki/Katherine_Mansfield

media_text = '''Kathleen Mansfield Murry (née Beauchamp; 14 October 1888 – 9 January 1923) was a prominent New Zealand modernist short story writer and poet who was born and brought up in colonial New Zealand and wrote under the pen name of Katherine Mansfield. 
At the age of 19, she left New Zealand and settled in England, where she became a friend of writers such as D. H. Lawrence and Virginia Woolf. 
Mansfield was diagnosed with extrapulmonary tuberculosis in 1917; the disease claimed her life at the age of 34.
'''

In [41]:
doc4 = nlp(media_text)

In [42]:
for ent in doc4.ents:
    print(ent.text, ent.label_)

Kathleen Mansfield Murry PERSON
Beauchamp GPE
14 October 1888 DATE
New Zealand GPE
New Zealand GPE
Katherine Mansfield PERSON
the age of 19 DATE
New Zealand GPE
England GPE
D. H. Lawrence PERSON
Virginia Woolf PERSON
Mansfield PERSON
1917 DATE
the age of 34 DATE


### Structuring your data

Let's get a bit more data. Say we want to count adjectives in every story in the collection; we could go about it like this.

In [43]:
# First, make a list of URLs to all the stories
contents_url = 'https://ucdh.github.io/scraping-garden-party/'
r = requests.get(contents_url)
soup = BeautifulSoup(r.text)
contents_list = soup.find_all('li')

# next we use an 'accumulator pattern' 
# make an empty list, then add the URLs to it one by one
story_urls = []

for item in contents_list:
    url_ending = item.a['href']
    story_urls.append('https://ucdh.github.io' + url_ending)

print(story_urls)

['https://ucdh.github.io/scraping-garden-party/at-the-bay.html', 'https://ucdh.github.io/scraping-garden-party/the-garden-party.html', 'https://ucdh.github.io/scraping-garden-party/daughters-late-colonel.html', 'https://ucdh.github.io/scraping-garden-party/mr-and-mrs-dove.html', 'https://ucdh.github.io/scraping-garden-party/the-young-girl.html', 'https://ucdh.github.io/scraping-garden-party/life-of-ma-parker.html', 'https://ucdh.github.io/scraping-garden-party/marriage-a-la-mode.html', 'https://ucdh.github.io/scraping-garden-party/the-voyage.html', 'https://ucdh.github.io/scraping-garden-party/miss-brill.html', 'https://ucdh.github.io/scraping-garden-party/her-first-ball.html', 'https://ucdh.github.io/scraping-garden-party/the-singing-lesson.html', 'https://ucdh.github.io/scraping-garden-party/the-stranger.html', 'https://ucdh.github.io/scraping-garden-party/bank-holiday.html', 'https://ucdh.github.io/scraping-garden-party/an-ideal-family.html', 'https://ucdh.github.io/scraping-garden-

Now we'll go through, story by story, and collect data about each one.

In [44]:
all_data = []

for story_url in story_urls:
    # create a variable to hold data about individual stories
    story_data = {}
    
    r = requests.get(story_url)
    soup = BeautifulSoup(r.text, 'lxml')
    clean_text = soup.get_text()
    
    # collect the title, while tidying up a bit
    story_data['story_title'] = soup.title.contents[0].split('|')[0]
    
    doc5 = nlp(clean_text)
    
    # collect adjective counts for 20 most common
    adjectives = [token.text for token in doc5 if token.pos_ == 'ADJ']
    adj_count = Counter(adjectives).most_common(20)
    
    # add counts for each word to our story_data variable
    for word in adj_count:
        story_data[word[0]] = word[1]
    
    # append our story data variable to the 'all_data' list before moving on to the next story
    all_data.append(story_data)

Let's see what the data for one story looks like:

In [45]:
# run this cell
all_data[0]

{'story_title': 'At the Bay ',
 'little': 66,
 'old': 27,
 'other': 20,
 'blue': 19,
 'small': 18,
 'big': 16,
 'white': 14,
 'long': 13,
 'black': 13,
 'same': 12,
 'bright': 12,
 'red': 11,
 'silly': 11,
 'own': 11,
 'good': 11,
 'whole': 10,
 'cold': 9,
 'first': 9,
 'open': 9,
 'yellow': 8}

We also need to organise our data into columns, and in many stories a lot of the adjectives won't appear, so those fields will be empty.

In [46]:
col_headings = []

for story in all_data:
    
    # get a list of adjectives for each story
    headings = list(story.keys())
    
    # for each adjective, add it to our column headings if it's not there already
    for heading in headings:
        if heading not in col_headings:
            col_headings.append(heading)
            
col_headings

['story_title',
 'little',
 'old',
 'other',
 'blue',
 'small',
 'big',
 'white',
 'long',
 'black',
 'same',
 'bright',
 'red',
 'silly',
 'own',
 'good',
 'whole',
 'cold',
 'first',
 'open',
 'yellow',
 'dark',
 'dear',
 'warm',
 'tall',
 'last',
 'right',
 'young',
 'poor',
 'green',
 'Good',
 'nice',
 'absurd',
 'such',
 'extravagant',
 'perfect',
 'pale',
 'weak',
 'more',
 'sure',
 'fond',
 'awful',
 'least',
 'strange',
 'soft',
 'dead',
 'short',
 'only',
 'quick',
 'immense',
 'serious',
 'Other',
 'bored',
 'sorry',
 'delighted',
 'dreadful',
 'desperate',
 'literary',
 'hard',
 'great',
 'flat',
 'fair',
 'worn',
 'new',
 'cool',
 'dull',
 'fat',
 'late',
 'precious',
 'loud',
 'round',
 'deep',
 'beautiful',
 'few',
 'high',
 'sad',
 'funny',
 'Little',
 'tiny',
 'fine',
 'faint',
 'Dear',
 'pink',
 'golden',
 'marvellous',
 'bad',
 'sharp',
 'surprised',
 'much',
 'Fast',
 'sweet',
 'silent',
 'ready',
 'ill',
 'grey',
 'alive',
 'thick',
 'tired',
 'ideal',
 'broad',
 'a

### Saving tabular data

#### CSV files

We can write out tabular data using a standard Python library called ```csv```. Again we open a file, create an object and then instead of a string (text) we write our rows of data to it.

In [47]:
from csv import DictWriter

# here we write to a 'tab separated file' which can be opened in Excel
with open('datafile.tsv', 'w') as f:
    fields = [col_headings]
    writer = DictWriter(f, fieldnames=col_headings, delimiter='\t')
    writer.writeheader()
    writer.writerows(all_data)

### Where to from here?

It's only worth investing time in learning to modify and write your own programs if you have a problem they can solve for you. Remember, it could take longer to write a program than to do the work another way.

However, it's important to know that if you want to, you can collect a lot of interesting data from online sources or extract information from within texts.

If you have ideas about data you want to work with, come and see me or Jennifer in the Arts Digital Lab.