# Features beyond words

To this point, we’ve looked at how to get capture the structure of text as tabular data in R. Words have been the atomic units of interest, or **features**, and the process of creating a table in which each feature is quantified for each text sample is often called **feature extraction**.

In this section we’ll look at some alternative feature sets beyond words—although most of these are derived from words in some way. Sometimes we collapse words together, for example by **stemming** or **part of speech** tagging, resulting in a smaller feature set. Sometimes, by contrast, we look at word combinations, **n-grams**, and the feature set becomes even bigger. In either case, we will likely go through a process of **feature selection** later, where we mix and match the features most useful to our specific task.

## Responsions between author and audience

One of my favourite aspects of Homer’s poems is the evidence that their oral-formulaic origins has left in the text of an intimate feedback loop between the singer and the audience: the scenes that have been elaborated and adorned with formulas to luxuriate in the listeners’ attention; the concise way information can be condensed when the plot needs to more forward.

But the feedback process is not limited to oral composition. Vergil showed the enomous originality that was possible in "fan fiction" when he turned the raw material of Homer to his own epic purposes. As he was still composing the *Aeneid*, he was already receiving "comments" from his audience, for example Propertius 2.34.66, *nescio quid maius nascitur Iliade*, "Something greater than the Iliad is coming to birth." In particular, he seems to quote Vergil’s own preface to the second half of the Aeneid, which he calls a *maius opus*, "a greater work" (Aen. 7.45).

## Fans and fan fiction

In 2014, author Anna Todd composed the novel *After* largely on her phone, using the online self-publishing platform Wattpad. The novel was conceived as a fan fiction, not of a book or film, but of the real-life boy-band One Direction. The story became wildly popular, with readership topping 1,000,000,000 views even as she was writing. According to a contemporary interview with the *New York Times*, she would generate new content daily in response to reader’s comments, shaping the storyline in response to their own fantasies as well as her own. Despite the fact that the novel is [still available for free on Wattpad](https://www.wattpad.com/story/5095707-after), Todd went on to sign a "six-figure" book deal, including film rights. 

Not all authors experience this feedback loop so positively, however. George R. R. Martin, author of *A Game of Thrones* and its sequels, reported in an interview that he had been forced to stop reading the theories of his fans online. As he continued to develop the labyrinthine subplots and Tolstoy-sized cast of characters, he had found himself deliberately trying to outsmart the readers. When they correctly guessed the path of the plot, he would change it to foil them. But, he grimly noted, "That way lies madness and disaster." As he fell victim to depression and writer’s block, some fans turned against him. In forums like [Finish the Book, George](https://grrrm.livejournal.com/), some goaded and provoked him. Eventually, the course of the novels was overcome by the television series.

## Case study: Annacharlier’s *Don’t Go*

<div style="float:right; width:250px; margin:1.5em">
    <figure>
        <img src="https://upload.wikimedia.org/wikipedia/en/1/1c/She-Ra_comparison.png" alt="image">
        <caption>She-Ra in 1985 and 2018. <br/>Source: wikipedia.org</caption>
    </figure>
</div>

For the rest of this session, I want to look at an interesting convolution of fan-author feedback, drawn from the fan fiction web site [Archive of Our Own](https://archiveofourown.org/) (AO3). Annacharlier’s 5000-word *Don’t Go* is the most-read submission within the fandom dedicated to "She-Ra and the Princesses of Power," a Netflix children’s cartoon. "She-Ra" is itself intertextual, being a 2018 reboot of an earlier children’s show from the 1980s, in part deliberately designed to subvert the original program’s messaging about gender norms, sexuality, and body image.

Like many fan fictions, *Don’t Go* represents beloved characters, in this case the protagonist Adora and her principal antagonist Catra, in the "downtime" between episodes, allowing the audience to indulge in a [sense of immersion](https://en.wikipedia.org/wiki/Parasocial_interaction) in the character’s everyday lives. Also like many fan fictions (dating back at least to the revolutionary Kirk/Spock amateur fictions of the mid-twentieth century), this one develops a same-sex romantic relationship (or **ship**) between principal characters. (Unlike the Kirk/Spock relationship, the "Catradora" ship was established as canonical in the series finale, released a mere 5 days before this story was posted.)
 
What makes the story interesting to me, however, is that not long after it appeared on the site, its pseudonymous author was revealed to be ND Stevenson, the show’s creater herself, granting quasi-canonical status to the events described along with enormous encouragement to other fan-fiction authors. In the examples below, we’ll practice downloading the story from AO3 and performing several different feature extraction techniques.

In [1]:
# only need to do this once
install.packages(c('rvest', 'polite'))


The downloaded binary packages are in
	/var/folders/_k/nhmmjzg96r318bm9jvyvv7ph0000gn/T//RtmpCxCe7u/downloaded_packages


These packages are useful tools for tidy-compatible web-scraping. Specifically, the package **rvest** (apparently a pun on "harvest") can be used for downloading pages from the web and parsing their content using **xpath** or **css selectors**, while **polite** is a wrapper for parts of rvest that ensures your script complies with the crawling/scraping policies set out in a site’s `robots.txt` file.

Abiding by the site’s limitations for automated downloads will make sure your script isn’t ruining the experience of human users. It also makes it less likely that you’ll be booted from the site in the middle of a long-running process.

In [2]:
library(tidyverse)
library(tidytext)
library(rvest)
library(polite)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


Attaching package: ‘rvest’


The following object is masked from ‘package:readr’:

    guess_encoding




### Create a session

This opens our session with the website, announcing our intention and checking for rules pertaining to automated scripts.

In [3]:
url <- 'https://www.archiveofourown.org/works/24280306'
session <- bow(url, force=TRUE)

Let’s inspect the session object:

In [4]:
session

<polite session> https://www.archiveofourown.org/works/24280306
    User-agent: polite R package - https://github.com/dmi3kno/polite
    robots.txt: 23 rules are defined for 3 bots
   Crawl delay: 5 sec
  The path is scrapable for this user-agent

This says that we’re allowed to read the site using non-human agents, but they want us to introduce a five-second delay between requests.

The `User-agent` line shows how we’ve identified ourselves to the site. If you want, you can specify something else by passing e.g. `user_agent='cforstall experiment number 1'` as an additional argument to `bow()`. For comparison, here is the equivalent field that my normal web browser reports to sites I’m visiting as a human:

    User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0
    
    


We can download a page using `scrape()`:

In [5]:
html_doc <- session %>% scrape()

The result is a complex object representing the structure of the HTML document. The easiest way to work with it is using rvest’s `html_*` functions: especially `html_elements()`, `html_attrs()`, `html_table()`, and `html_text2()`.

These allow you to search for specific parts of the page using xpath or css expressions. For example, all the `<div>` nodes:

In [6]:
html_doc %>% html_elements('div')

{xml_nodeset (24)}
 [1] <div id="outer" class="wrapper">\n      <ul id="skiplinks"><li><a href=" ...
 [2] <div id="header" class="region">\n\n  <h1 class="heading">\n    <a href= ...
 [3] <div id="login" class="dropdown">\n      <p class="user actions" role="m ...
 [4] <div id="small_login" class="simple login">\n\t<form class="new_user" id ...
 [5] <div class="clear"></div>
 [6] <div id="inner" class="wrapper">\n        <!-- BEGIN sidebar -->\n       ...
 [7] <div id="main" class="works-show region" role="main">\n          \n      ...
 [8] <div class="flash"></div>
 [9] <div class="wrapper">\n\n  <dl class="work meta group" role="complementa ...
[10] <div id="workskin">\n  <div class="preface group">\n    <h2 class="title ...
[11] <div class="preface group">\n    <h2 class="title heading">\n      Don't ...
[12] <div class="summary module" role="complementary">\n          <h3 class=" ...
[13] <div id="chapters" role="article">\n        <h3 class="landmark heading" ...
[14] <div class="

Here, we use **css** selectors to look for the first `<div>` of class `summary`, then take the first `<blockquote>` from within that, to islote the "Summary" section at the top of Annacharlier’s story. Then we extract just the text:

In [7]:
html_doc %>% 
    html_element('div.summary.module') %>%
    html_element('blockquote') %>% 
    html_text2()

### Manual inspection

This part of your research demands a lot of time-consuming manual inspection of the web site you’re interested in. You need to understand how the underlying HTML is structured so that you can pick out the information you want. This almost always entails bespoke solutions and lots of trial and error.

I like to use the "Developer Tools" feature of my web browser to examine the structure of the page. You might prefer to use your favourite text editor. There are also third-party plugins or extensions that you can add to your browser.

Here’s what the site looks like in my browser:

<div style="margin:1em; padding:1em">
<img src="img/annacharlier.png">
</div>
    
Here’s the underlying source HTML for the "Summary" section, also viewed in my browser:

<div style="margin:1em; padding:1em">
<img src="img/annacharlier_source.png">
</div>
    
And here’s what it looks like using the browser’s "inspector" tool. I use Firefox, but the other major browsers have similar functionality. (Sometimes you have to check a box in "Settings" or "Preferences" to enable these tools.)

<div style="margin:1em; padding:1em">
<img src="img/annacharlier_inspector.png">
</div>

Let’s extract the main body of the text from this page so we can process it. It looks to me like the story is enclosed in `<div id="chapters" role="article">`, and then within `<div class="userstuff">`. Let’s try that as a first attempt.

In [8]:
text <- html_doc %>% 
    html_element('div#chapters div.userstuff') %>% 
    html_text2()

In [9]:
str(text)

 chr "They stay that way for a long time, Catra’s face tucked into Adora’s shoulder, Adora gently cradling Catra’s he"| __truncated__


In [10]:
nchar(text)

Let’s tokenize into words with `unnest_tokens()`. But `unnest_tokens()` is expecting a tibble, so first we have to create a tibble with a single row:

In [11]:
# create a one-row tibble for Annacharlier's story
fan_fics <- tibble(
    ao3_id = 24280306,
    text = text
)

# tokenize
tokens <- fan_fics %>%
    unnest_tokens(output=word, input=text)

In [12]:
tokens %>% head(10)

ao3_id,word
<dbl>,<chr>
24280306,they
24280306,stay
24280306,that
24280306,way
24280306,for
24280306,a
24280306,long
24280306,time
24280306,catra’s
24280306,face


## n-grams

Let’s examine some other tokenization options with `unnest_tokens()`. In addition to individual words, we can also break the text into **n-grams**, groups of successive words. Because word frequencies are highly dependent on each other, **bigrams** (pairs), **trigrams** (triples), or even larger groups of words are often more informative than single words (i.e. **unigrams**).

To tokenize into ngrams, add the optional argument `token="ngrams"` and specify a value for `n`:

In [13]:
# tokenize
bigrams <- fan_fics %>%
    unnest_tokens(output=bigram, input=text, token="ngrams", n=2)

head(bigrams)

ao3_id,bigram
<dbl>,<chr>
24280306,they stay
24280306,stay that
24280306,that way
24280306,way for
24280306,for a
24280306,a long


The two-word window slides along the text one word at a time, so each word (except the first) appears twice, once as the right-hand member and once as the left-hand member.

Which bigrams are most common?

In [14]:
bigrams %>%
    count(bigram, sort=TRUE) %>%
    head(10)

bigram,n
<chr>,<int>
in the,18
she doesn’t,18
of the,16
as she,15
her eyes,15
in her,15
her head,14
she can,14
and she,13
on the,13


## Skip-grams

Sometimes two or more words form a significant collocation but don’t appear exactly side-by-side. For this, we can use "skip_ngrams" as the value for `token`. In addition to passing `n`, we can also specify `k`, the maximum number of intervening words. 

For example, this should find all words that co-occur with no more than four words between them. I’m also using `n_min` to eliminate unigrams from the output.

In [15]:
fan_fics %>%
    unnest_tokens(output=skipgram, input=text, token="skip_ngrams", n=2, n_min=2, k=4) %>%
    count(skipgram, sort=TRUE) %>%
    head(10)

skipgram,n
<chr>,<int>
she the,45
to her,44
her her,42
her she,42
the the,41
her and,40
the her,39
and her,38
catra her,37
her the,35


### Separating n-grams for analysis

It may be helpful, after you’ve tallied the n-grams, to split them back into their component words. For this, we can use the function `separate()`, which splits a single column as if it were a string, based on some separator pattern. By default, `unnest_tokens()` joins the words of an n-gram with a space, so we’ll split on spaces.

In [16]:
bigrams <- bigrams %>%
    count(bigram, sort=TRUE) %>%
    separate(bigram, into=c('left', 'right'), sep=' ', remove=FALSE)

In [17]:
head(bigrams)

bigram,left,right,n
<chr>,<chr>,<chr>,<int>
in the,in,the,18
she doesn’t,she,doesn’t,18
of the,of,the,16
as she,as,she,15
her eyes,her,eyes,15
in her,in,her,15


In [18]:
bigrams %>%
    filter(right=='eyes')

bigram,left,right,n
<chr>,<chr>,<chr>,<int>
her eyes,her,eyes,15
adora’s eyes,adora’s,eyes,2
catra eyes,catra,eyes,1
clone eyes,clone,eyes,1
door eyes,door,eyes,1
entrapta’s eyes,entrapta’s,eyes,1
gold eyes,gold,eyes,1
other’s eyes,other’s,eyes,1
over eyes,over,eyes,1
slowly eyes,slowly,eyes,1


In [19]:
bigrams %>%
    filter(left=="catra’s") %>% 
    head(20)

bigram,left,right,n
<chr>,<chr>,<chr>,<int>
catra’s hand,catra’s,hand,4
catra’s face,catra’s,face,3
catra’s back,catra’s,back,2
catra’s hands,catra’s,hands,2
catra’s head,catra’s,head,2
catra’s room,catra’s,room,2
catra’s wrists,catra’s,wrists,2
catra’s broken,catra’s,broken,1
catra’s chuckle,catra’s,chuckle,1
catra’s claws,catra’s,claws,1
