# A few text processing tasks with `nltk`
We're going to cover a few common text processing tasks that we can do with `nltk`. You may want to use some or all of these for your research. No need to reinvent the wheel!

First, let's get nltk up and running:

In [34]:
import nltk

After that imports, run the following command:

In [35]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

That will bring up a window on your machine. Select the row called `book`, which should include all of the elements we'll use in class today. Download that, and we'll follow along.

# Splitting by sentence
We know how to split by words. But you may be interested in other units like the sentence. `nltk` can help us get all of the sentences out of a book:

In [37]:
import os
hp_dir = '/Users/e/code/literarytextmining/corpora/harry_potter/texts'
hp_relative = os.listdir(hp_dir)
print(hp_relative)
hp_absolute = []
for x in hp_relative:
    abs_path = os.path.join(hp_dir, x)
    hp_absolute.append(abs_path)

['5 Order of the Phoenix.txt', '4 Goblet of Fire.txt', '6 Half-Blood Prince.txt', '1 Sorcerers Stone.txt', '3 Prisoner of Azkaban.txt', '7 Deathly Hallows.txt', '2 Chamber of Secrets.txt']


In [38]:
hp_absolute

['/Users/e/code/literarytextmining/corpora/harry_potter/texts/5 Order of the Phoenix.txt',
 '/Users/e/code/literarytextmining/corpora/harry_potter/texts/4 Goblet of Fire.txt',
 '/Users/e/code/literarytextmining/corpora/harry_potter/texts/6 Half-Blood Prince.txt',
 '/Users/e/code/literarytextmining/corpora/harry_potter/texts/1 Sorcerers Stone.txt',
 '/Users/e/code/literarytextmining/corpora/harry_potter/texts/3 Prisoner of Azkaban.txt',
 '/Users/e/code/literarytextmining/corpora/harry_potter/texts/7 Deathly Hallows.txt',
 '/Users/e/code/literarytextmining/corpora/harry_potter/texts/2 Chamber of Secrets.txt']

In [39]:
prince = open(hp_absolute[2]).read()

In [40]:
prince[:100]

'CHAPTER ONE\n\n\n\n\n\nTHE OTHER MINISTER\n\n\n\n\nIt was nearing midnight and the Prime Minister was sitting a'

In [41]:
# we've got a lot of whitespace. let's replace all that with single spaces
import re
prince = re.sub('\s+', ' ', prince)

In [42]:
prince[:100]

'CHAPTER ONE THE OTHER MINISTER It was nearing midnight and the Prime Minister was sitting alone in h'

Now, we can test out `nltk`'s sentence tokenizer:

In [43]:
from nltk.tokenize import sent_tokenize
sent_tokenize(prince)[:5]

['CHAPTER ONE THE OTHER MINISTER It was nearing midnight and the Prime Minister was sitting alone in his office, reading a long memo that was slipping through his brain without leaving the slightest trace of meaning behind.',
 'He was waiting for a call from the President of a far distant country, and between wondering when the wretched man would telephone, and trying to suppress unpleasant memories of what had been a very long, tiring, and difficult week, there was not much space in his head for anything else.',
 'The more he attempted to focus on the print on the page before him, the more clearly the Prime Minister could see the gloating face of one of his political opponents.',
 'This particular opponent had appeared on the news that very day, not only to enumerate all the terrible things that had happened in the last week (as though anyone needed reminding) but also to explain why each and every one of them was the government’s fault.',
 'The Prime Minister’s pulse quickened at the

# Looking for specific sentence structures

`sent_tokenize` returns a list of sentences. We could do a lot of things with this. For instance, perhaps we only want to see all of the sentences with the words `not only` and `but also` in them, as in the sentence structure "not this but that."

In [44]:
prince_sents = sent_tokenize(prince)

for sent in prince_sents:
    if 'not only' in sent.lower() and 'but also' in sent.lower():
        print(sent)
        print('-'*80)

This particular opponent had appeared on the news that very day, not only to enumerate all the terrible things that had happened in the last week (as though anyone needed reminding) but also to explain why each and every one of them was the government’s fault.
--------------------------------------------------------------------------------
The more Harry pored over the book, the more he realized how much was in there, not only the handy hints and shortcuts on potions that were earning him such a glowing reputation with Slughorn, but also the imaginative little jinxes and hexes scribbled in the margins, which Harry was sure, judging by the crossings-out and revisions, that the Prince had invented himself.
--------------------------------------------------------------------------------
Harry awoke next morning feeling slightly dazed and confused by a series of dreams in which Ron had chased him with a Beater’s bat, but by midday he would have happily exchanged the dream Ron for the real 

## Similes
Another easy type of sentence to look for would be similes. Similes are a subset of metaphor where the comparison is offset by the word `like a` or the phrase `as if`, `such as`, etc.

A classic distinction between metaphor and simile exists in the trope about the "taste" of knowledge.

The poem ["At the Fishhouses"](https://www.poetryfoundation.org/poems/52192/at-the-fishhouses) by Elizabeth Bishop contains two excellent examples comparing knowledge to the ocean:
- Simile: "(The sea) is like what we imagine knowledge to be: / dark, salt, clear, moving, utterly free,"
- Metaphor: "I have seen it over and over, the same sea, the same, / slightly, indifferently swinging above the stones, / icily free above the stones, / above the stones and then the world."

In [45]:
simile_sents = []
simile_phrases = ['as if', 'such as', 'like a']

for sent in prince_sents:
    for phrase in simile_phrases:
        if phrase in sent.lower():
            simile_sents.append(sent)

In [46]:
len(simile_sents)

91

In [48]:
simile_sents[:5]

['He was, after all, the Prime Minister and did not appreciate being made to feel like an ignorant schoolboy.',
 'The papers had a field day with it, ‘breakdown of law and order in the Prime Minister’s backyard —’” “And as if all that wasn’t enough,” said Fudge, barely listening to the Prime Minister, “we’ve got dementors swarming all over the place, attacking people left, right, and center.',
 'The Prime Minister’s first, foolish thought was that Rufus Scrimgeour looked rather like an old lion.',
 'You wouldn’t —” “There is nothing I wouldn’t do anymore!” Narcissa breathed, a note of hysteria in her voice, and as she brought down the wand like a knife, there was another flash of light.',
 'At last, Narcissa hurried up a street named Spinner’s End, over which the towering mill chimney seemed to hover like a giant admonitory finger.']

## Tripping up on ellipses
Let's take another look at our sentences:

In [16]:
prince_sents[30:50]

['He turned very slowly to face the empty room.',
 '“Hello?” he said, trying to sound braver than he felt.',
 'For a brief moment he allowed himself the impossible hope that nobody would answer him.',
 'However, a voice responded at once, a crisp, decisive voice that sounded as though it were reading a prepared statement.',
 'It was coming — as the Prime Minister had known at the first cough — from the froglike little man wearing a long silver wig who was depicted in a small, dirty oil painting in the far corner of the room.',
 '“To the Prime Minister of Muggles.',
 'Urgent we meet.',
 'Kindly respond immediately.',
 'Sincerely, Fudge.” The man in the painting looked inquiringly at the Prime Minister.',
 '“Er,” said the Prime Minister, “listen.',
 '.',
 '.',
 '.',
 'It’s not a very good time for me.',
 '.',
 '.',
 '.',
 'I’m waiting for a telephone call, you see .',
 '.',
 '.']

`nltk` is good, but it isn't perfect. Those single `.` suggest that ellipses are tricking it. We can try to fix that with a simple `replace`:

In [50]:
prince = prince.replace('. . .','...')

Let's try it again with the cleaned text:

In [51]:
sent_tokenize(prince)[30:50] # looks good!

['Urgent we meet.',
 'Kindly respond immediately.',
 'Sincerely, Fudge.” The man in the painting looked inquiringly at the Prime Minister.',
 '“Er,” said the Prime Minister, “listen... .',
 'It’s not a very good time for me... .',
 'I’m waiting for a telephone call, you see ... from the President of —” “That can be rearranged,” said the portrait at once.',
 'The Prime Minister’s heart sank.',
 'He had been afraid of that.',
 '“But I really was rather hoping to speak —” “We shall arrange for the President to forget to call.',
 'He will telephone tomorrow night instead,” said the little man.',
 '“Kindly respond immediately to Mr. Fudge.” “I ... oh ... very well,” said the Prime Minister weakly.',
 '“Yes, I’ll see Fudge.” He hurried back to his desk, straightening his tie as he went.',
 'He had barely resumed his seat, and arranged his face into what he hoped was a relaxed and unfazed expression, when bright green flames burst into life in the empty grate beneath his marble mantelpiece.',

In [20]:
prince_sents = sent_tokenize(prince)

# How many sentences are in my text?
Of course this lets us easily check the number of sentences in the text:

In [52]:
len(sent_tokenize(prince))

8762

# How many words in each sentence in my text?

In [23]:
for sent in prince_sents[:10]: # just demoing the first ten
    word_count = len(sent.split(' '))
    print(word_count, sent)
    print('-'*80)

37 CHAPTER ONE THE OTHER MINISTER It was nearing midnight and the Prime Minister was sitting alone in his office, reading a long memo that was slipping through his brain without leaving the slightest trace of meaning behind.
--------------------------------------------------------------------------------
51 He was waiting for a call from the President of a far distant country, and between wondering when the wretched man would telephone, and trying to suppress unpleasant memories of what had been a very long, tiring, and difficult week, there was not much space in his head for anything else.
--------------------------------------------------------------------------------
31 The more he attempted to focus on the print on the page before him, the more clearly the Prime Minister could see the gloating face of one of his political opponents.
--------------------------------------------------------------------------------
46 This particular opponent had appeared on the news that very day, no

# How many words on average in a Rowling sentence?
1. Append all sentence lengths to a list
2. Sum the list
3. Divide that sum by the length of the list

In [53]:
sent_lens = []

for sent in prince_sents:
    word_count = len(sent.split(' '))
    sent_lens.append(word_count)
    
avg_sent = sum(sent_lens) / len(sent_lens)

print(avg_sent)

14.188467060137183


20 words is a *very* long average sentence. Generally in the 19th and 20th centuries, the average sentence in English language fiction is between 12 and 14 words.

# What about the median Rowling sentence?
1. Append all sentence lengths to a list
2. Sort the list
3. Identify the middle-most value

In [54]:
sent_lens = []

for sent in prince_sents:
    word_count = len(sent.split(' '))
    sent_lens.append(word_count)
    
sent_lens = sorted(sent_lens)
median = sent_lens[round(len(sent_lens)/2)]
median

11

# What about the modal frequency of a Rowling sentence?
i.e. which sentence length does Rowling write the most times?

In [56]:
sent_lens = []

for sent in prince_sents:
    word_count = len(sent.split(' '))
    sent_lens.append(word_count)
    

In [57]:
sent_lens[:5]

[37, 51, 31, 46, 19]

In [58]:
d ={}
for x in sent_lens:
    if x not in d:
        d[x] = 1
    else:
        d[x] +=1

In [60]:
from collections import Counter # this is a fast way to count all of the values in a list like this
len_freqs = Counter(sent_lens)
# yes, you can also use Counter to calculate the number of tokens for each type in your list of words

In [149]:
len_freqs[5], len_freqs[10], len_freqs[15]

(270, 326, 293)

In [67]:
for key,value in len_freqs.items():
    if value == max(len_freqs.values()):
        print(key)

13


Rowling most commonly writes 13-word sentences.

# Let's see examples of each of these

In [63]:
medians = []
means = []
modes = []

for sent in prince_sents:
    word_count = len(sent.split(' '))
    if word_count == 17:
        medians.append(sent)
    elif word_count == 14:
        means.append(sent)
    elif word_count == 13:
        modes.append(sent)

In [64]:
len(medians)

280

In [65]:
len(means)

316

In [66]:
len(modes)

381

In [67]:
medians[:5]

['“A grim mood has gripped the country,” the opponent had concluded, barely concealing his own broad grin.',
 'The Prime Minister had seen that kind of look in politicians before, and it never boded well.',
 'Fudge had then patted the shoulder of the still-dumbstruck Prime Minister in a fatherly sort of way.',
 '“No need to worry, no need to worry!” shouted Fudge, already with one foot in the flames.',
 'While the Prime Minister surreptitiously touched the wood of his desk, Fudge continued, “But Black’s by-the-by now.']

In [68]:
means[:5]

['The Prime Minister felt it himself; people really did seem more miserable than usual.',
 'Even the weather was dismal; all this chilly mist in the middle of July.',
 'He froze, nose to nose with his own scared-looking reflection in the dark glass.',
 'from the President of —” “That can be rearranged,” said the portrait at once.',
 '“You’re — you’re not a hoax, then?” It had been his last, desperate hope.']

In [69]:
modes[:5]

['How on earth was his government supposed to have stopped that bridge collapsing?',
 'Sincerely, Fudge.” The man in the painting looked inquiringly at the Prime Minister.',
 'He was thinner, balder, and grayer, and his face had a crumpled look.',
 '“Not to worry,” he had said, “it’s odds-on you’ll never see me again.',
 'And I must say, you’re taking it a lot better than your predecessor.']