# Chapter 3: Text Analysis

-- *A Python Course for the Humanities by Folgert Karsdorp and Maarten van Gompel*

---

In this chapter we will introduce you to the task of text analysis in Python. You will learn how to read an entire corpus into Python, clean it and how to perform certain data analyses on those texts. We will also briefly introduce you to using Python's plotting library *matplotlib*, with which you can visualize your data.

Before we delve into the main subject of this chapter, text analysis, we will first write a couple of utility functions that build upon the things you learnt in the previous chapter. Often we don't work with a single text file stored at our computer, but with multiple text files or entire corpora. We would like to have a way to load a corpus into Python.

Remember how to read files? Each time we had to open a file, read the contents and ensure that the file is closed. Since this is a series of steps we will often need to do, we can write a single function that does all that for us. We write a small utility function `read_file(filename)` that reads the specified file and simply returns all contents as a single string.

In [None]:
def read_file(filename):
    """Read the contents of `filename` and return as a string."""
    with open(filename) as infile:
        contents = infile.read()
    return contents

Now, instead of having to remember the steps to read a file, we can just call the function `read_file`:

In [None]:
text = read_file('data/austen-emma-excerpt.txt')
print(text)

In the directory `data/gutenberg/training` we have a corpus consisting of multiple files with the extension `.txt`. This corpus is a collection of English novels which we downloaded for you from the [Gutenberg](http://www.gutenberg.org) project. We want to iterate over all the filenames. You can do this using the `glob` function from the `glob` module in the Python standard library. We can import this function as follows:

In [None]:
from glob import glob

After that, the `glob` function is available to use. This function takes as argument a pattern for filenames and returns all the matching files and directories:

In [None]:
glob('data/*')

Notice that `glob` returns a list and we can iterate over that list. A `*` in a pattern stands for any directory or filename; the result above lists all the files and directories in the `data` directory. We can make the pattern more specific and select only files ending with `.txt`. The following code will read all text files in the directory `data/gutenberg/training` and outputs the length (in characters) of each:

In [None]:
for filepath in glob('data/gutenberg/training/*.txt'):
    text = read_file(filepath)
    print(f'{filepath} has {len(text)} characters.')

---

## General Text Statistics

> When the next night came, Dinarazad said to her sister Shahrazad: ‘In God’s name, sister, if you are not asleep, then tell us one of your stories!’ Shahrazad answered: ‘With great pleasure! I have heard tell, honoured King, that…’

*Alf Laylah Wa Laylah*, *the Stories of One Thousand and One Nights* is a collection of folk tales, collected over many centuries by various authors, translators, and scholars across West, Central and South Asia and North Africa, forms a huge narrative wheel with an overarching plot, created by the frame story of Shahrazad.

The stories begin with the tale of king Shahryar and his brother, who, having both been deceived by their respective Sultanas, leave their kingdom, only to return when they have found someone who — in their view — was wronged even more. On their journey the two brothers encounter a huge jinn who carries a glass box containing a beautiful young woman. The two brothers hide as quickly as they can in a tree. The jinn lays his head on the girl’s lap and as soon as he is asleep, the girl demands the two kings to make love to her or else she will wake her ‘husband’. They reluctantly give in and the brothers soon discover that the girl has already betrayed the jinn ninety-eight times before. This exemplar of lust and treachery strengthens the Sultan’s opinion that all women are wicked and not to be trusted. 

When king Shahryar returns home, his wrath against women has grown to an unprecedented level. To temper his anger, each night the king sleeps with a virgin only to execute her the next morning. In order to make an end to this cruelty and save womanhood from a "virgin scarcity", Sharazad offers herself as the next king’s bride. On the first night, Sharazad begins to tell the king a story, but she does not end it. The king’s curiosity to know how the story ends, prevents him from executing Shahrazad. The next night Shahrazad finishes her story, and begins a new one. The king, eager to know the ending of this tale as well, postpones her execution once more. Using this strategy for One Thousand and One Nights in a labyrinth of stories-within-stories-within-stories, Shahrazad attempts to gradually move the king’s cynical stance against women towards a politics of love and justice (see Marina Warner’s *Stranger Magic* (2013) in case you're interested).

The first European version of the Nights was translated into French by Antoine Galland. Many translations (in different languages) followed, such as the (heavily criticized) English translation by Sir Richard Francis Burton entitled *The Book of the Thousand and a Night* (1885). This version is freely available from the Gutenberg project (see [here](http://www.gutenberg.org)), and will be the one we will explore here.

---

#### Quiz!

In the directory `data/arabian_nights` you will find 999 files. This is because in Burton's translation some nights are missing. The name of the file represents the corresponding night of storytelling in *Alf Laylah Wa Laylah*. Go have a look. Use `glob` to get a list of all the file names. Store the result in a variable named `filenames`.

Use the tokenize function and the corpus reading function we have defined above and tokenize and clean each night. Store the result in the variable named `corpus`.

In [None]:
# insert your code here

---

Great job! You now should have a list of 999 file names. It is always important to check whether our code actually produces the desired results. Let's check whether we indeed have 999 texts:

In [None]:
print(len(filenames))

OK, that seems to be correct. It would be convenient for further processing to have the corpus in chronological order. Let's have a look the first 20 file names:

In [None]:
filenames[:20]

As you can see the files are sorted by their string name and not by their numbering. To be able to sort the files by their numbers we must first remove the extension `.txt` as well as the directory `data/arabian_nights/`.

---

#### Quiz!

**1)** Write a function `remove_txt` that takes as argument a string and some extension that you want to remove. It should return the string without the extension. Since filenames are just strings, we could strip off the extension by looking for the period. However, since manipulating filenames is so common, there are functions for this in the standard library. This makes it both easier to manipulate filenames correctly, and makes it more obvious what you are doing to others reading your code. See the function `splitext` from the `os.path` module. Look up the documentation [here](http://docs.python.org/3/library/os.path.html#os.path.splitext).

In [None]:
from os.path import splitext

def remove_ext(filename):
    # insert your code here
    
# these tests should return True if your code is correct
print(remove_ext("data/arabian_nights/1.txt") == "data/arabian_nights/1")
print(remove_ext("ridiculous_selfie.jpg") == "ridiculous_selfie")

**2)** Write a function `remove_dir` that takes as argument a filepath and removes the directory from a filepath. Tip: use the function `basename` from the `os.path` module. Look up the documentation [here](http://docs.python.org/3/library/os.path.html#os.path.basename)

In [None]:
from os.path import basename

def remove_dir(filepath):
    # insert your code here
    
# these tests should return True if your code is correct
print(remove_dir("data/arabian_nights/1.txt") == "1.txt")
print(remove_dir("/a/kind/of/funny/filepath/to/file.txt") == "file.txt")

**3)** Combine the two functions `remove_ext` and `remove_dir` into one function `remove_dir_ext`. This function takes as argument a filepath and returns the name (without the extensions) of the file.

In [None]:
def remove_dir_ext(filepath):
    # insert your code here
    
# these tests should return True if your code is correct
print(remove_dir_ext("data/arabian_nights/1.txt") == '1')

---

The final step is to convert numbers represented as string (e.g. "1" and "10") to a number. This can be achieved by using the function `int`:

In [None]:
x_as_string = "1"
x_as_int = int(x_as_string)
print(x_as_int)

Remember that strings are different types than integers. To see this, have a look at the following:

In [None]:
x = "1"
y = "2"
print(x + y)

12? Yes, 12. This is because, as you might remember from the first chapter, we can use the `+` operator to concatenate two strings. If we apply the same operation to integers, as in:

In [None]:
x = 1
y = 2
print(x + y)

we get the expected result of 3.

---

#### Quiz!

Combine the functions `int` and `remove_dir_ext` into the function `get_night` to obtain the integer corresponding to a night.

In [None]:
def get_night(filepath):
    # insert your code here

# these tests should return True if your code is correct
print(get_night("data/arabian_nights/1.txt") == 1)

---

OK, so now we can convert the filepaths to integers corresponding to the nights of storytelling. But how will we use that to sort the corpus in chronological order? In chapter 1 we briefly discussed how to sort your collection of good reads. In combination with our `get_night` function, we can use `sort` to obtain a nicely chronologically ordered list of stories. Prepare yourself for some real Python magic, because the following lines of code might be a little dazzling...

We start with the variable `filenames`, which already has the list of file names we want to sort.
Next we call the function `sorted()` on this list and supply as keyword our function `get_night`:

In [None]:
filenames = sorted(filenames, key=get_night)
filenames[:20]

A keyword argument is an argument that you specify by name. Keyword arguments must come after normal arguments.

As you can see, we now have a perfectly chronologically ordered list of filenames. But how, **HOW!** did that work? As you might have guessed, the keyword argument `key=get_night` of `sorted` has something to do with all this magic. Without this argument, Python would just sort the filenames alphabetically:

In [None]:
filenames = glob('data/arabian_nights/*.txt')
filenames = sorted(filenames)
print(filenames[:20])

However, if we supply a function to `key`, Python will apply that function to each item before comparing it to other items during the sorting. In our case this means Python converts all filepaths to integers.

---

## Sentence splitting and Spacy

In the previous chapter we wrote a simple function to clean a text so that we could split it into a list of words ("tokenize"). For example:

```python
cleaned = clean_text('This is a sentence. Should we seperate it from this one?')
tokens = cleaned.split()
print(tokens)
```
gave:
```
['this', 'is', 'a', 'sentence', 'should', 'we', 'seperate', 'it', 'from', 'this', 'one']
```
To recap, this is a list of strings. Can you extract the first token?

In [None]:
tokens = ['this', 'is', 'a', 'sentence', 'should', 'we', 'seperate', 'it', 'from', 'this', 'one']
# replace ... with your code
print(... == 'this')

How about the first character of the second token?

In [None]:
print(... == 'i')

As you can see, using this function we lose information about where sentences end and start in the text. What if we want to know about both tokens and sentence boundaries? The sentence is an important unit of analysis in language. For example, longer sentences are typically considered more difficult.

We can preserve information on sentences by adding another level of lists. We can represent a text as a list of sentences, where each sentence is a list of strings (tokens).

Rather than taking our simple tokenizer and adding a rudimentary sentence splitter to it, we will now turn to a library that does tokenization and sentence splitting, and handles some of the harder edge cases. A simple strategy would split sentences whenever sentence-ending punctuation is encountered (`?!.`), but consider for example:

- The use of `.` as end-of-sentence marker vs. in abbreviations and initials
- Question marks inside direct speech: `"How are you?", she asked.`

Similarly, consider difficult tokenization cases:

- John's bike
- New York-based
- nineteenth- and twentieth-century writers
- (SAM)-missile launchers
- i.e.  Ph.D. &c.
- ;-) :))) #yolo#lit#hashtag
- https://en.wikipedia.org/wiki/Bicycle
- 8-Cyclopentyl-1,3-dimethylxanthine

The [Spacy](https://spacy.io/) library includes a good tokenizer and sentence splitter, among other Natural Language Processing (NLP) tasks. Install Spacy using the Anaconda Navigator (follow [these instructions](https://docs.anaconda.com/anaconda/navigator/getting-started/#navigator-managing-packages)).

Spacy works rather different than the code we wrote until now. A proper introduction of Spacy is beyond the scope of this notebook, but the documentation of Spacy is excellent and they provide an [online course](https://course.spacy.io/) which you can follow.

For now, we will use the following code to do lower casing, punctuation elimination, tokenization, and sentence splitting with Spacy:

In [None]:
from spacy.lang.en import English
nlp = English()
sentencizer = nlp.create_pipe("sentencizer") # Add sentence splitting
nlp.add_pipe(sentencizer)

def tokenize_sentence(sent):
    """Given a sentence, return lowercased tokens without punctuation and whitespace."""
    return [token.lower_ for token in sent if not token.is_punct and not token.is_space]

def preprocess(text):
    """Given a document as string, return it as a list of sentences,
    where each sentence is a list of cleaned words."""
    doc = nlp(text)
    return [tokenize_sentence(sent) for sent in doc.sents]

- 1: We import the library and specify the language we will use (Spacy supports more languages).
- 2: We create an object to do NLP tasks for English text
- 3-4: By default, sentences are not split, so we add it to the pipeline. Spacy support several levels of analysis; for now we are only concerned with tokenization and sentence splitting.
- 6-8: The `tokenize_sentence` function handles a single sentence: the lower-case version of tokens is used, and punctuation and whitespace is filtered. Since tokens are special objects for Spacy, they provide this information as pre-computed attributes (i.e., we don't have to call a method to get this information).
- 10-14: The `preprocess` function takes a text as string and applies the NLP pipeline to it. It returns the result as a list of sentences, which in turn are lists of tokens.

Note: we could also return the `doc` result directly, but to use this object, we have to know how Spacy works. By converting everything to lists and strings we can work with datastructures that we already know in the rest of this notebook.

In [None]:
# Let's test the code:
preprocess('This is a sentence. Should we seperate it from this one?')

In [None]:
# What about tricky cases? (This is a single string spread over several lines)
text = """She asked: "Is Dr. Phil a doctor?". "Dr. Phil has a
Ph.D. but is not an M.D." I replied. "He even lost
his license to practice therapy," I added."""
preprocess(text)

Note that for readability, Python prints long lists with one element per line. Remember that within lists, whitespace is completely free and not significant.

We can now go back to our sorted list of filenames and read our corpus. We read each file, apply the preprocessing function to it, and collect the result in one big `corpus` list (this may take a minute!):

In [None]:
corpus = []
filenames = glob('data/arabian_nights/*.txt')
filenames = sorted(filenames, key=get_night)
for filename in filenames:
    text = read_file(filename)
    corpus.append(preprocess(text))

The `preprocess` function already returned two levels of lists (and strings), and now we added yet another level. Do you understand the meaning of each level? Try the following:

In [None]:
corpus[0]

In [None]:
corpus[0][0]

In [None]:
corpus[0][0][0]

In [None]:
corpus[0][0][0][0]

Decribe each level in one word:

1. ...
2. ...
3. ...
4. ...

---

### Exploratory data analysis

As a first exploratory data analysis, we are going to compute for each night how many sentences it contains and how many words. It is quite easy to count the number of sentences per night, since each night is represented by a list of sentences.

In [None]:
sentences_per_night = []
for night in corpus:
    sentences_per_night.append(len(night))
print(sentences_per_night[:10])

Using the function `max` we can find out what the highest number of sentences is:

In [None]:
max(sentences_per_night)

Similarly, if we would like to now what the lowest number of sentences is, we use the function `min`:

In [None]:
min(sentences_per_night)

---

#### Quiz!

The function `sum` takes a list of numbers as input and returns the sum:

In [None]:
print(sum([1, 3, 3, 4]))

Use this function to compute [the average](https://en.wikipedia.org/wiki/Average#Arithmetic_mean) number of sentences per night.

In [None]:
# insert your code here

---

Given our data structure of a list of sentences which are themselves lists of words, it is a little trickier to count for each night how many words it contains. One possible way is the following:

In [None]:
words_per_night = []
for night in corpus:
    n_words = 0
    for sentence in night:
        n_words += len(sentence)
    words_per_night.append(n_words)

Make sure you really understand these lines of code as you will need them in the next quiz. 

The suspense created by Shahrazad’s story-telling skills is intriguing, especially the “cliff-hanger” ending each night which she uses to avert her own execution (and possibly that of womanhood). Every night she tells the Sultan a story only to stop at dawn and she picks up the thread the next night. But does it really take the whole night to tell a particular story?

I am not aware of any exact numbers about how many words people speak per minute. Averages seem to fluctuate between 100 and 200 words per minute. Narrators are advised to use approximately 150 words per minute in audiobooks. I suspect that this number is a little lower for live storytelling and assume it lies around 130 words per minute (including pauses). Using this information, we can compute the time it takes to tell a particular story as follows:

$$\textrm{story time}(\textrm{text}) = \frac{\textrm{number of words in text}}{\textrm{number of words per minute}}$$

---

#### Quiz!

**1)** Write a function called `story_time` that takes as input a text. Given a speed of 130 words per minute, compute how long it takes to tell that text.

In [None]:
def story_time(text):
    # insert your code here

# these tests should return True if your code is correct
print(story_time([["story", "story"]]) * 130 == 2.0)

**2)** Compute the story_time for each night in our corpus. Assign the result to the variable `story_time_per_night`.

In [None]:
story_time_per_night = []
# insert your code here
print(story_time_per_night[:10])

**3**) Compute the average, minimum and maximum story telling time.

In [None]:
# insert your code here

---

### Visualizing general statistics

Now that we have computed a range of general statistics for our corpus, it would be nice to visualize them. Python's plotting library *matplotlib* (see [here](http://matplotlib.org)) allows us to produce all kinds of graphs. In addition to importing the library, we need to issue a special command to ensure that our plots will be displayed in the notebook:

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

Now we can, for example, plot for each story how many sentences it contains:

In [None]:
plt.plot(sentences_per_night);  # ';' at the end suppresses the return value, we only want to see the plot

---

#### Quiz!

**1)** Can you do the same for `words_per_night`?

In [None]:
# insert your code here 

**2)** And can you do the same for `story_time_per_night`?

In [None]:
# insert your code here

**3)** In this final exercise we will put everything together what we have learnt so far. We want you to write a function `positions_of` that returns for a given word all sentence positions in the *Arabian Nights* where that word occurs. We are not interested in the positions relative to a particular night, but only to the corpus as a whole. Use that function to find all occurrences of the name *Sharahzad* and store the corresponding indices in the variable `positions_of_shahrazad`. Do the same thing for the name *Ali*. Store the result in `positions_of_ali`. Finally, find all occurrences of *Egypt* and store the indices in `positions_of_egypt`. Hint: to compute the position of a word in the whole corpus, you will need keep track of a running total of words as you through each night and sentence.

In [None]:
def positions_of(word):
    # insert your code here

positions_of_shahrazad = positions_of('shahrazad')
positions_of_ali = positions_of('ali')
positions_of_egypt = positions_of('egypt')

If everything went well, the following lines of code should produce a nice [dispersion plot](https://medium.com/the-political-ear/tutorial-plotting-lexical-dispersion-conspiracy-lies-from-the-left-of-center-c0b39de442d5) of all sentence occurrences of Shahrazad, Ali and Egypt in the corpus.

In [None]:
plt.figure(figsize=(20, 8))
names = ['Shahrazad', 'Ali', 'Egypt']
plt.plot(positions_of_shahrazad, [1] * len(positions_of_shahrazad), '|', markersize=100)
plt.plot(positions_of_ali, [2] * len(positions_of_ali), '|', markersize=100)
plt.plot(positions_of_egypt, [0] * len(positions_of_egypt), '|', markersize=100)
plt.yticks(range(len(names)), names)
plt.ylim(-1, 3);

---

> Then Shahrazad reached the morning, and fell silent in the telling of her tale…

---

<p><small><a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" property="dct:title">Python Programming for the Humanities</span> by <a xmlns:cc="http://creativecommons.org/ns#" href="http://fbkarsdorp.github.io/python-course" property="cc:attributionName" rel="cc:attributionURL">http://fbkarsdorp.github.io/python-course</a> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>. Based on a work at <a xmlns:dct="http://purl.org/dc/terms/" href="https://github.com/fbkarsdorp/python-course" rel="dct:source">https://github.com/fbkarsdorp/python-course</a>.</small></p>