DHOxSS Computational thinking Example 1 - Concordance

The Natural Language Toolkit (NLTK) provides a library of text processing tools and access to some corpora - for more info see https://www.nltk.org/

The following lines of python code simply set up the library for use - these do not need to be changed

In [18]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('gutenberg')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/davidderoure/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/davidderoure/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package gutenberg to
[nltk_data]     /Users/davidderoure/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


True

Here we create a small sample of text to experiment with

In [2]:
txt="""
It is a truth universally acknowledged, that a single man in possession of a good fortune must
be in want of a wife. However little known the feelings or views of such a man may be on his
first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding
families, that he is considered as the rightful property of some one or other of their daughters."""


Now we ask python to show us this text (if there are line breaks they appear as \n) 

In [3]:
txt

'\nIt is a truth universally acknowledged, that a single man in possession of a good fortune must\nbe in want of a wife. However little known the feelings or views of such a man may be on his\nfirst entering a neighbourhood, this truth is so well fixed in the minds of the surrounding\nfamilies, that he is considered as the rightful property of some one or other of their daughters.'

We use the NLTK sentence tokenizer to identify the individual sentences - it gives us a list of sentences, using the list syntax [ item, item, item, ... ]

In [4]:
sents = sent_tokenize(txt)
sents

['\nIt is a truth universally acknowledged, that a single man in possession of a good fortune must\nbe in want of a wife.',
 'However little known the feelings or views of such a man may be on his\nfirst entering a neighbourhood, this truth is so well fixed in the minds of the surrounding\nfamilies, that he is considered as the rightful property of some one or other of their daughters.']

Check the length of the list of sentences

In [5]:
len(sents)

2

Print on the screen each of the sentences in the list 

In [6]:
for s in sents:
    print(s)


It is a truth universally acknowledged, that a single man in possession of a good fortune must
be in want of a wife.
However little known the feelings or views of such a man may be on his
first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding
families, that he is considered as the rightful property of some one or other of their daughters.


Now print the lengths of each sentence - these are numbers of characters

In [7]:
for s in sents:
    print(len(s))

117
260


Now we tokenize into words instead of sentences

In [8]:
words = word_tokenize(txt)
print(words)

['It', 'is', 'a', 'truth', 'universally', 'acknowledged', ',', 'that', 'a', 'single', 'man', 'in', 'possession', 'of', 'a', 'good', 'fortune', 'must', 'be', 'in', 'want', 'of', 'a', 'wife', '.', 'However', 'little', 'known', 'the', 'feelings', 'or', 'views', 'of', 'such', 'a', 'man', 'may', 'be', 'on', 'his', 'first', 'entering', 'a', 'neighbourhood', ',', 'this', 'truth', 'is', 'so', 'well', 'fixed', 'in', 'the', 'minds', 'of', 'the', 'surrounding', 'families', ',', 'that', 'he', 'is', 'considered', 'as', 'the', 'rightful', 'property', 'of', 'some', 'one', 'or', 'other', 'of', 'their', 'daughters', '.']


In [9]:
len(words)

76

To demonstrate a simple automated task which the computer can perform easily, here we use the NLTK library to identify parts of speech

In [10]:
wordpos=nltk.pos_tag(words)
wordpos

[('It', 'PRP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('truth', 'NN'),
 ('universally', 'RB'),
 ('acknowledged', 'VBD'),
 (',', ','),
 ('that', 'IN'),
 ('a', 'DT'),
 ('single', 'JJ'),
 ('man', 'NN'),
 ('in', 'IN'),
 ('possession', 'NN'),
 ('of', 'IN'),
 ('a', 'DT'),
 ('good', 'JJ'),
 ('fortune', 'NN'),
 ('must', 'MD'),
 ('be', 'VB'),
 ('in', 'IN'),
 ('want', 'NN'),
 ('of', 'IN'),
 ('a', 'DT'),
 ('wife', 'NN'),
 ('.', '.'),
 ('However', 'RB'),
 ('little', 'JJ'),
 ('known', 'VBN'),
 ('the', 'DT'),
 ('feelings', 'NNS'),
 ('or', 'CC'),
 ('views', 'NNS'),
 ('of', 'IN'),
 ('such', 'JJ'),
 ('a', 'DT'),
 ('man', 'NN'),
 ('may', 'MD'),
 ('be', 'VB'),
 ('on', 'IN'),
 ('his', 'PRP$'),
 ('first', 'JJ'),
 ('entering', 'VBG'),
 ('a', 'DT'),
 ('neighbourhood', 'NN'),
 (',', ','),
 ('this', 'DT'),
 ('truth', 'NN'),
 ('is', 'VBZ'),
 ('so', 'RB'),
 ('well', 'RB'),
 ('fixed', 'VBN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('minds', 'NNS'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('surrounding', 'VBG'),
 ('families', 'NNS'),
 

This is an alternative way of printing out all the items in a list

In [11]:
print(*wordpos)

('It', 'PRP') ('is', 'VBZ') ('a', 'DT') ('truth', 'NN') ('universally', 'RB') ('acknowledged', 'VBD') (',', ',') ('that', 'IN') ('a', 'DT') ('single', 'JJ') ('man', 'NN') ('in', 'IN') ('possession', 'NN') ('of', 'IN') ('a', 'DT') ('good', 'JJ') ('fortune', 'NN') ('must', 'MD') ('be', 'VB') ('in', 'IN') ('want', 'NN') ('of', 'IN') ('a', 'DT') ('wife', 'NN') ('.', '.') ('However', 'RB') ('little', 'JJ') ('known', 'VBN') ('the', 'DT') ('feelings', 'NNS') ('or', 'CC') ('views', 'NNS') ('of', 'IN') ('such', 'JJ') ('a', 'DT') ('man', 'NN') ('may', 'MD') ('be', 'VB') ('on', 'IN') ('his', 'PRP$') ('first', 'JJ') ('entering', 'VBG') ('a', 'DT') ('neighbourhood', 'NN') (',', ',') ('this', 'DT') ('truth', 'NN') ('is', 'VBZ') ('so', 'RB') ('well', 'RB') ('fixed', 'VBN') ('in', 'IN') ('the', 'DT') ('minds', 'NNS') ('of', 'IN') ('the', 'DT') ('surrounding', 'VBG') ('families', 'NNS') (',', ',') ('that', 'IN') ('he', 'PRP') ('is', 'VBZ') ('considered', 'VBN') ('as', 'IN') ('the', 'DT') ('rightful', '

We print out all the nouns, indicated by: 
NN 	    Noun, singular or mass
NNP 	Proper noun, singular
NNS 	Noun, plural
NNPS 	Proper noun, plural 

In [12]:
for word,pos in wordpos:
         if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS'):
             print(word)

truth
man
possession
fortune
want
wife
feelings
views
man
neighbourhood
truth
minds
families
property
daughters


Now we use a larger amount of text. This line of code gives us the names of the available texts

In [13]:
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

We choose one of them to work with

In [14]:
emma = nltk.text.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))

Now we try generating concordances for some individual words

In [15]:
emma.concordance("man")

Displaying 25 of 235 matches:
ss for her friend . Mr . Weston was a man of unexceptionable character , easy f
of mind or body , he was a much older man in ways than in years ; and though ev
s required support . He was a nervous man , easily depressed ; fond of every bo
ood - humoured , pleasant , excellent man , that he thoroughly deserves a good 
cessary . Mr . Knightley , a sensible man about seven or eight - and - thirty ,
 " A straightforward , open - hearted man like Weston , and a rational , unaffe
" " Mr . Elton is a very pretty young man , to be sure , and a very good young 
 , to be sure , and a very good young man , and I have a great regard for him .
use his own wife . Depend upon it , a man of six or seven - and - twenty can ta
s ' marriage , he was rather a poorer man than at first , and with a child to m
hrough . He had never been an unhappy man ; his own temper had secured him from
nd report of him as a very fine young man had made Highbury feel a sort of prid
d a very f

In [16]:
emma.concordance("woman")

Displaying 25 of 131 matches:
ce had been supplied by an excellent woman as governess , who had fallen little
Weston , and a rational , unaffected woman like Miss Taylor , may be safely lef
ways longed for -- enough to marry a woman as portionless even as Miss Taylor ,
l a well - judging and truly amiable woman could be , and must give him the ple
on of it . The aunt was a capricious woman , and governed her husband entirely 
 . She felt herself a most fortunate woman ; and she had lived long enough to k
 uncommon degree of popularity for a woman neither young , handsome , rich , no
s possible . And yet she was a happy woman , and a woman whom no one named with
nd yet she was a happy woman , and a woman whom no one named without good - wil
. She was a plain , motherly kind of woman , who had worked hard in her youth ,
could meet with a good sort of young woman in the same rank as his own , with a
 he marries a very ignorant , vulgar woman , certainly I had better not visit h
ing young 

In [17]:
emma.concordance("gentleman")

Displaying 25 of 35 matches:
re can be no doubt of your being a gentleman ' s daughter , and you must suppor
Knightley . But he is not the only gentleman you have been lately used to . Wha
tion was most suitable , quite the gentleman himself , and without low connexio
"-- unclosing a pretty sketch of a gentleman in small size , whole - length --"
tion it would not have disgraced a gentleman ; the language , though plain , wa
ied to a respectable , intelligent gentleman - farmer !" " As to the circumstan
ly be a doubt that her father is a gentleman -- and a gentleman of fortune .-- 
her father is a gentleman -- and a gentleman of fortune .-- Her allowance is ve
ement or comfort .-- That she is a gentleman ' s daughter , is indubitable to m
 gentlemen are ; and nothing but a gentleman in education and manner has any ch
 , " Oh ! dear , yes ," before the gentleman joined them . The wants and suffer
. Mr . John Knightley was a tall , gentleman - like , and very clever man ; ris
modern days