# Build your own concordance

It took 500 Dominican munks to write the first concordance of the Latin bible, and it took Rabbi Mordecai Nathan 10 years to write the first concordance of the Hebrew bible. With Python, it only takes a matter of seconds to find words in a text, along with the surrounding words.

Run each cell in this notebook one at a time, in order. If something in one cell doesn't work right, it might be because you have overwritten a variable, so try going back and running all the previous cells again.

First run the code and check that everything works. Then, try modifying the code. Start with the first challenges, and then continue in order. Feel free to work together, and see how far you can get. The important thing is to learn, not to solve all the challenges!

In [4]:
# install the natural language toolkit package (nltk), which has a copy of several texts, 
#including the King James Bible

%pip install nltk

Collecting nltk
  Downloading nltk-3.9.1-py3-none-any.whl.metadata (2.9 kB)
Collecting joblib (from nltk)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting regex>=2021.8.3 (from nltk)
  Downloading regex-2024.9.11-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
Downloading nltk-3.9.1-py3-none-any.whl (1.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading regex-2024.9.11-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (782 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m782.7/782.7 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading joblib-1.4.2-py3-none-any.whl (301 kB)
Installing collected packages: regex, joblib, nltk
Successfully installed joblib-1.4.2 nltk-3.9.1 regex-2024.9.11
Note: you may need to restart the kernel to use updated packages.


In [5]:
# import the nltk package so that it is accessible to Python, and download a collection of texts from Project Gutenberg
import nltk
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to /home/ucloud/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


True

In [3]:
# Create a variable called "bible" which contains the text of the King James bible.
bible = nltk.corpus.gutenberg.raw('bible-kjv.txt')

# make all characters lowercase
bible = bible.lower()

# remove the "\n" characters, which indicate line breaks in the text (newlines)
bible = bible.replace('\n', ' ')

# split up the text into a long list of individual words
bible = bible.split(' ')

In [4]:
# make a variable called "concordance", and fill it with every occurrence of the phrase "this world", and a few words preceeding and following "this world"
concordance = []
for i, val in enumerate(bible):
    if val == "world":
        if bible[i-1] == "this":
            concordance.append(str(' '.join(bible[i-5:i+5])))

In [5]:
# take a look at what the algorithm has found
concordance

['for the children of this world are in their generation',
 'them, the children of this world marry, and are given',
 'hateth his life in this world shall keep it unto',
 'shall the prince of this world be cast out. ',
 'should depart out of this world unto the father, having',
 'for the prince of this world cometh, and hath nothing',
 'because the prince of this world is judged.  16:12',
 'of the princes of this world knew: for had they',
 'for the wisdom of this world is foolishness with god.',
 'for the fashion of this world passeth away.  7:32',
 'whom the god of this world hath blinded the minds',
 'chosen the poor of this world rich in faith, and',
 'saying, the kingdoms of this world are become the kingdoms']

In [6]:
# let's see how many instances of the phrase "this world" were found
len(concordance)


13

Let's try again, but this time let's just search for "world" by itself, not "this world".

In [7]:
concordance = []
for i, val in enumerate(bible):
    if val == "world":
        concordance.append(str(' '.join(bible[i-5:i+5])))

In [8]:
# take a look at what the algorithm has found
concordance

['and he hath set the world upon them.  2:9',
 'appeared, the foundations of the world were discovered, at the',
 'him, all the earth: the world also shall be stable,',
 'upon the face of the world in the earth. ',
 'and he shall judge the world in righteousness, he shall',
 'and the foundations of the world were discovered at thy',
 'all the ends of the world shall remember and turn',
 'all the inhabitants of the world stand in awe of',
 'not tell thee: for the world is mine, and the',
 'is thine: as for the world and the fulness thereof,',
 'he hath girded himself: the world also is stablished, that',
 'that the lord reigneth: the world also shall be established',
 'earth: he shall judge the world with righteousness, and the',
 'also he hath set the world in their heart, so',
 'and i will punish the world for their evil, and',
 'kingdoms; 14:17 that made the world as a wilderness, and',
 'fill the face of the world with cities.  14:22',
 'all the kingdoms of the world upon the face o

In [9]:
# let's see how many instances of just the word "world" were found
len(concordance)

113

Now, in the cell below, modify the code to search for a different word.

In [13]:
# add your modified code here and run the cell...
concordance = []
for i, val in enumerate(bible):
    if val == "believe":
        concordance.append(str(' '.join(bible[i-5:i+5])))

In [15]:
#see the results
concordance

['but, behold, they will not believe me, nor hearken unto',
 'hand: 4:5 that they may believe that the lord god',
 'pass, if they will not believe thee, neither hearken to',
 'first sign, that they will believe the voice of the',
 'pass, if they will not believe also these two signs,',
 'i speak with thee, and believe thee for ever. ',
 'will it be ere they believe me, for all the',
 'this thing ye did not believe the lord your god,',
 'their fathers, that did not believe in the lord their',
 'and ye inhabitants of jerusalem; believe in the lord your',
 'so shall ye be established; believe his prophets, so shall',
 'on this manner, neither yet believe him: for no god',
 'me; yet would i not believe that he had hearkened',
 'him?  39:12 wilt thou believe him, that he will',
 '26:25 when he speaketh fair, believe him not: for there',
 'that ye may know and believe me, and understand that',
 'called a multitude after thee: believe them not, though they',
 'and jesus saith unto them, belie

In [16]:
len(concordance)

103

The nltk package has the full text of several other classic books. You can see what they are called by running the command in the cell below:

In [17]:
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

## Your turn!

Here are a some more things you can try. In each case, I have given you a little bit of starter code to get you going, but the cells will not run without some additional code from you.




## Challenge 1: build your own concordance

Pick a different book and a different word, or pair of words. Copy and paste from the code above to write some Python code that searches the book of your choice for the word or pair of words of your choice. Put this code in the cell below. By the way, some of the texts use the characters "\r" for "carriage return" instead of "\n" for "newline". You can remove these the same way that you remove the "\n" characters.

In [37]:
# add your code to search for a word or pair of words in a different book here
# we chose moby dick

moby = nltk.corpus.gutenberg.raw('melville-moby_dick.txt')

# make all characters lowercase
moby = moby.lower()

# remove the "\n" characters, which indicate line breaks in the text (newlines)
moby = moby.replace('\n',' ')
moby = moby.replace('\r',' ')

# split up the text into a long list of individual words
moby = moby.split(' ')

In [38]:
#more code for moby dick
concordance_moby = []
for i, val in enumerate(moby):
    if val == "whale":
        if moby[i-1] == "this":
            concordance_moby.append(str(' '.join(moby[i-5:i+5])))

In [39]:
#the brackets? maybe footnotes?
concordance_moby

['(folio), chapter v. (razor-back).--of this whale little is  known',
 ' his face.  this whale averages some sixteen or',
 '(octavo), chapter iv. (killer).--of this whale little is  precisely',
 ' this man and this whale again came  together,',
 "all beale's drawings of this whale are good, excepting the",
 "with him, i'll eat this whale  in one mouthful.",
 "a ship's jib-boom.  this whale is  not dead;",
 'manner.  but if this whale be a king, he',
 'inclined plane!  hurrah! this whale carries the  everlasting',
 ' and as for this whale spout, you might ',
 'of the fishery,  this whale had become entangled in',
 'many hours  hence this whale will have gone two']

In [40]:
len(concordance_moby)

12

In [41]:
concordance_moby = []
for i, val in enumerate(moby):
    if val == "sea":
        concordance_moby.append(str(' '.join(moby[i-5:i+5])))

In [42]:
concordance_moby

['   "the indian sea breedeth the most and',
 'all sides, and beating the sea  before him into',
 'progress.    "that sea beast  leviathan, which',
 'whales which swim in a sea of water, and have',
 'of water, and have a sea of  oil swimming',
 '  "whales in the sea god\'s voice obey." --n.',
 'them, that when out at sea they are afraid to',
 'the first of may, the sea being then covered with',
 'high time to get to sea as soon  as',
 'the old persians hold the sea holy?  why did',
 'the habit of going to sea whenever i  begin',
 'that i ever go to sea as  a passenger.',
 'do i ever go to sea  as a commodore,',
 'again, i always go to sea as a sailor, because',
 'finally, i always go to sea as a sailor, because',
 'after having repeatedly smelt the sea  as a merchant',
 "dart you through.--it's the black sea in a  midnight",
 '  a tramping of sea boots was heard in',
 'candle on a crazy old sea  chest that did',
 'his keeping his arm at sea unmethodically  in sun',
 'and  third mates, 

In [43]:
len(concordance_moby)

149

## Challenge 2: compare lengths of books

We can use the command `len` to find how many items there are in a list. E.g., to find the number of words in the list called `bible`, above, we can write: `len(bible)`. 

Use the starter code below to find out which book in the books included in `nltk` has the most words.

In [103]:
# solution 1: print all the titles and numbers of words
# starter code:

books = nltk.corpus.gutenberg.fileids()

for title in books:
    book = nltk.corpus.gutenberg.raw(title)
    book = book.lower()
    book = book.replace('\n',' ')
    book = book.replace('\r',' ')
    book = book.split(' ')
    len(book)
    print(title, '=', len(book))

austen-emma.txt = 164457
austen-persuasion.txt = 86270
austen-sense.txt = 123514
bible-kjv.txt = 848001
blake-poems.txt = 8886
bryant-stories.txt = 54942
burgess-busterbrown.txt = 17976
carroll-alice.txt = 28387
chesterton-ball.txt = 86481
chesterton-brown.txt = 80382
chesterton-thursday.txt = 59297
edgeworth-parents.txt = 195982
melville-moby_dick.txt = 243947
milton-paradise.txt = 91832
shakespeare-caesar.txt = 23339
shakespeare-hamlet.txt = 33477
shakespeare-macbeth.txt = 20164
whitman-leaves.txt = 138730


In [104]:
#The bible has the most words

In [105]:
#we did this before we figured out the loop:)
emma = nltk.corpus.gutenberg.raw('austen-emma.txt')
persuasion = nltk.corpus.gutenberg.raw('austen-persuasion.txt')
sense = nltk.corpus.gutenberg.raw('austen-sense.txt')
poems = nltk.corpus.gutenberg.raw('blake-poems.txt')
stories = nltk.corpus.gutenberg.raw('bryant-stories.txt')
busterbrown = nltk.corpus.gutenberg.raw('burgess-busterbrown.txt')
alice = nltk.corpus.gutenberg.raw('carroll-alice.txt')
ball = nltk.corpus.gutenberg.raw('chesterton-ball.txt')
brown = nltk.corpus.gutenberg.raw('chesterton-brown.txt')
thursday = nltk.corpus.gutenberg.raw('chesterton-thursday.txt')
parents = nltk.corpus.gutenberg.raw('edgeworth-parents.txt')
moby_dick = nltk.corpus.gutenberg.raw('melville-moby_dick.txt')
paradise = nltk.corpus.gutenberg.raw('milton-paradise.txt')
caesar = nltk.corpus.gutenberg.raw('shakespeare-caesar.txt')
hamlet = nltk.corpus.gutenberg.raw('shakespeare-hamlet.txt')
macbeth = nltk.corpus.gutenberg.raw('shakespeare-macbeth.txt')
leaves = nltk.corpus.gutenberg.raw('whitman-leaves.txt')

In [174]:
# more advanced, for those with some Python experience, or those who want to google around..
# solution 2: make a list of titles and a list of wordcounts, zip them together, then sort them based on wordcount
# starter code:

books = nltk.corpus.gutenberg.fileids()

titles = ['bible', 'emma', 'persuasion', 'sense', 'poems', 'stories', 'busterbrown', 'alice', 'ball', 'brown', 'thursday', 'parents', 'moby', 'paradise', 'caesar', 'hamlet', 'macbeth', 'leaves']

numwords = [len(bible), len(emma), len(persuasion), len(sense), len(poems), len(stories), len(busterbrown), len(alice), len(ball), len(brown), len(thursday), len(parents), len(moby), len(paradise), len(caesar), len(hamlet), len(macbeth), len(leaves)]
for title in books:
    book = nltk.corpus.gutenberg.raw(title)

titles_and_numword = [titles, numwords]
print(titles_and_numword)


[['bible', 'emma', 'persuasion', 'sense', 'poems', 'stories', 'busterbrown', 'alice', 'ball', 'brown', 'thursday', 'parents', 'moby', 'paradise', 'caesar', 'hamlet', 'macbeth', 'leaves'], [848001, 887071, 466292, 673022, 38153, 249439, 84663, 144395, 457450, 406629, 320525, 935158, 243947, 468220, 112310, 162881, 100351, 711215]]


In [175]:
books = nltk.corpus.gutenberg.fileids()

#remember to put the list outside of the loop! Otherwise it will be overwritten within the loop

titles = []
numwords = []

for title in books:
    book = nltk.corpus.gutenberg.raw(title)
    book = book.lower()
    book = book.replace('\n',' ')
    book = book.replace('\r',' ')
    book = book.split(' ')
    titles.append(title)
    numwords.append(len(book))



In [176]:
import pandas as pd

df = pd.DataFrame(
    {'titles': titles,
     'numwords': numwords,
    })

In [177]:
df

Unnamed: 0,titles,numwords
0,austen-emma.txt,164457
1,austen-persuasion.txt,86270
2,austen-sense.txt,123514
3,bible-kjv.txt,848001
4,blake-poems.txt,8886
5,bryant-stories.txt,54942
6,burgess-busterbrown.txt,17976
7,carroll-alice.txt,28387
8,chesterton-ball.txt,86481
9,chesterton-brown.txt,80382


In [179]:
#sorting by wourd count
#ascending=False to sort them from highest to lowest number

df.sort_values("numwords", ascending=False)

Unnamed: 0,titles,numwords
3,bible-kjv.txt,848001
12,melville-moby_dick.txt,243947
11,edgeworth-parents.txt,195982
0,austen-emma.txt,164457
17,whitman-leaves.txt,138730
2,austen-sense.txt,123514
13,milton-paradise.txt,91832
8,chesterton-ball.txt,86481
1,austen-persuasion.txt,86270
9,chesterton-brown.txt,80382


## Challenge 3: what are the most frequent words?

`nltk` has a built-in function called `FreqDist` which counts up how many times each word in a text occurs. So, if you have a list called `words` which contains all the words in a book, you can find the frequencies of all of them by writing `freq = nltk.FreqDist(words)`. You can then get the e.g. ten most common words by writing `freq.most_common(10)`. What are the ten most common words in Jane Austen's "Emma"? What about Herman Melville's "Moby Dick"?

In [187]:
# starter code:

book = nltk.corpus.gutenberg.raw('austen-emma.txt')
words = book.lower()
words = words.split(' ')

#how do I get rid of space?

freq = nltk.FreqDist(words)
freq.most_common(10)


[('the', 4522),
 ('to', 4497),
 ('', 3704),
 ('of', 3651),
 ('and', 3449),
 ('a', 2760),
 ('i', 2151),
 ('was', 2066),
 ('not', 1819),
 ('in', 1815)]

In [188]:
#What are the ten most common words in "Moby Dick"?

book = nltk.corpus.gutenberg.raw('melville-moby_dick.txt')
words = book.lower()
words = words.split(' ')
freq = nltk.FreqDist(words)
freq.most_common(10)


[('the', 11791),
 ('', 5719),
 ('of', 5709),
 ('and', 5232),
 ('to', 3978),
 ('a', 3949),
 ('in', 3533),
 ('that', 2327),
 ('his', 2061),
 ('it', 1503)]

## Challenge 4: Remove stopwords

Often, the most frequent words are not the most interesting ones. Words like "a" and "the" are so common in English, that they don't really tell us much about the text. That is why we often remove "stopwords", that is, a list of the most common words in English, before e.g. counting frequencies. There are several of these lists available, in [English]((https://gist.github.com/sebleier/554280)) as well as other languages, such as [Danish](https://gist.github.com/berteltorp/0cf8a0c7afea7f25ed754f24cfc2467b). Below is some starter code to remove stopwords. Use these snippets to see what the most common words in Emma and Moby Dick are after removing these most common words.

Hint: In Moby Dick, you will also have to remove the string `\r`, in addition to removing `\n`.

In [6]:
# list of stopwords

stopwords = ["a", "", "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these", "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do", "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "s", "t", "can", "will", "just", "don", "should", "now"]

In [8]:
# starter code:
# finding the 10 most common words in Emma

book = nltk.corpus.gutenberg.raw('austen-emma.txt')
words = book.lower()
words = words.split(' ')
words = [word for word in words if word not in stopwords]
freq = nltk.FreqDist(words)
freq.most_common(10)

print(freq.most_common(10))

[('mr.', 936), ('could', 686), ('would', 679), ('mrs.', 575), ('miss', 497), ('must', 473), ('much', 375), ('every', 363), ('said', 354), ('one', 327)]


The most common words in Emma are:
mr.
could
would
mrs.
miss
must
much
every
said
one

In [9]:
# doing the same thing for Moby Dick
book = nltk.corpus.gutenberg.raw('melville-moby_dick.txt')
words = book.lower()
words = words.split(' ')
words = [word for word in words if word not in stopwords]
freq = nltk.FreqDist(words)
freq.most_common(10)

print(freq.most_common(10))

[('one', 683), ('upon', 475), ('like', 469), ('whale', 455), ('old', 365), ('would', 349), ('though', 245), ('yet', 245), ('must', 236), ('said', 233)]


The most common words in Moby Dick are:
one
upon
like
whale
old
would
though
yet
must
said