# NLP Homework 1  

Comparing Corpora with Corpus Statistics.  

For this homework, select or make two documents.  You can use books from the Gutenberg project already provided by NLTK, the corpora in the nltk.book package, you can choose large documents of your own, or you can put together groups of smaller documents to make two large documents out of the corpora.  

Try to pick two documents that are different in character in some aspect:  generally either topic, style, genre or some cultural aspect.  The work in this assignment is to run word frequencies, bigram frequencies and mutual information scores on the two documents.  Then you will select items from these lists to make a comparison between the documents to answer some question about the differences or similarities between them.  

1.  Choosing the data:  either  
a)	Choose existing large documents from NLTK or from the Gutenberg collection on the web, or  
b)	Collect your own data, by using your own documents or collecting data from other sources.  Combine the text from these   sources to make two documents for the corpora for the first task.  Describe the method that you used to define and collect the data, including the difference between the documents.  Note any limitations to the method or the text that you were able to find.  Do preprocessing to get the text in a suitable format for processing and describe what you did.  

2.  Examine the text in the documents that you chose and decide how to process the words, i.e. decide on tokenization and whether to use all lower case, stopwords or lemmatization.  Using the process developed in the lab,   
 * list the top 50 words by frequency (normalized by the length of the document)  -- `L1 Normalization`
 * list the top 50 bigrams by frequencies, and  
 * list the top 50 bigrams by their Mutual Information scores (using min frequency 5)    

Note that you may wish to modify the stop word list, based on your question in Task 3.  To complete this part:  
a)	Briefly state why you chose the processing options that you did.  
b)	Are there any problems with the word or bigram lists that you found? Could you get a better list of bigrams?   
c)	How are the top 50 bigrams by frequency different from the top 50 bigrams scored by Mutual Information?  
d)	If you modify the stop word list, or expand the methods of filtering, describe that here.  
e)	You may choose to also run top trigram lists, and include them in the analysis in part 3.  

3. Describe a problem or question that is based on the difference between the two documents.  In the case of literary works, for example, this could be how to characterize the style between two authors or two works of different classes.  Another example would be to compare the informal text in blogs with more formal text.  Or you can do a topic related comparison that selects words (as in the SOTU speeches example).  You could also make a comparison of similar text but at two different times.   

Now answer the question you have chosen by giving a discussion of the comparison of the texts.  Using one or more of the types of measures that you ran in the first task, i.e. word frequencies, bigram frequencies, or bigram mutual information, make a comparison of the two documents to answer the problem or question.  For this analysis, you will want to choose or to revise data that will be applicable for your question. You may wish to hand pick out particular examples of word frequencies, bigram frequencies or mutual information scores that contribute evidence for your comparison, or combine examples into categories.    

Make sure you include the following in your report:  
a)	Clearly describe the problem or question you are trying to address through the comparison between the two selected documents.  
b)	Present and explain insights or conclusions based on the comparison to answer the question (do not just report numbers)  


|Hierarchy | Explanation|
|:--|:--|
|Token | word/vocaburary|
|Document | A string consist of a set of tokens|
|Corpus | A set of documents|
|Corpora | A set of corpus|

|NLTK Methods | Syntax | Explanation|
|:--|:--|:--|
|regexp_tokenize | (raw_text, pattern) | `raw_text` is a string representing a document and `pattern` is a string representing the regex pattern you wish to apply|
|FreqDist | (tokens)| `tokens` is a list of tokens|
|most_common() | FreqDist(tokens).most_common(n) | Return a list of the `n` most common elements and their counts from the most common to the least|
|bigram | nltk.bigrams(tokens) | `tokens` is a list of tokens|
|apply_freq_filter| (frequency) | `frequency` is the minimum amount of times that a token must appear|
|bigram_measures |nltk.collocations.BigramAssocMeasures() | A collection of bigram association measures.|
|BigramCollocationFinder | | A tool for the finding and ranking of bigram collocations or other association measures. It is often useful to use from_words() rather than constructing an instance directly.|
|score_ngrams |BigramCollocationFinder.score_ngrams(bigram_measures,raw_freq ) | Returns the score for a given bigram using the given scoring function. Returns a list of bigrams and their associated normalized frequency distribution.|

In [1]:
import nltk
from nltk import FreqDist
nltk.corpus.gutenberg.fileids( )

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

## shakespeare-hamlet

Get the text of the book, `Hamlet from William Shakespeare`, from the Gutenberg corpus, tokenize it, and reduce the tokens to lowercase.

In [2]:
# The book filename
hamlet = nltk.corpus.gutenberg.fileids( )[15]

# Raw text of the book 
hamlettext = nltk.corpus.gutenberg.raw(hamlet)
print(hamlettext[:500])

[The Tragedie of Hamlet by William Shakespeare 1599]


Actus Primus. Scoena Prima.

Enter Barnardo and Francisco two Centinels.

  Barnardo. Who's there?
  Fran. Nay answer me: Stand & vnfold
your selfe

   Bar. Long liue the King

   Fran. Barnardo?
  Bar. He

   Fran. You come most carefully vpon your houre

   Bar. 'Tis now strook twelue, get thee to bed Francisco

   Fran. For this releefe much thankes: 'Tis bitter cold,
And I am sicke at heart

   Barn. Haue you had quiet Guard?
  Fran. Not


In [3]:
# convert to lowercase
hamlet_text_lower = hamlettext.lower()
hamlet_tokens = nltk.word_tokenize(hamlet_text_lower) 

In [4]:
print('There are {} words in the book of Hamlet from William Shakespeare'.format(len(hamlet_tokens)))

There are 36380 words in the book of Hamlet from William Shakespeare


In [5]:
# Creating a frequency distribution of words - dictionary
fdist = FreqDist(hamlet_tokens)
fdist.most_common(50)

[(',', 2892),
 ('.', 1877),
 ('the', 993),
 ('and', 862),
 ('to', 683),
 ('of', 610),
 (':', 566),
 ('i', 559),
 ('you', 527),
 ('my', 502),
 ('a', 497),
 ('?', 459),
 ('it', 419),
 ('in', 388),
 ('that', 376),
 ('is', 372),
 ('ham', 337),
 ('not', 327),
 (';', 298),
 ('his', 285),
 ('this', 275),
 ('with', 254),
 ('your', 253),
 ('but', 249),
 ('for', 243),
 ('me', 228),
 ('what', 211),
 ('lord', 211),
 ('as', 205),
 ('he', 202),
 ("'d", 200),
 ('be', 191),
 ('so', 189),
 ('him', 178),
 ('haue', 175),
 ('king', 172),
 ('will', 149),
 ('no', 137),
 ('our', 130),
 ('we', 128),
 ('on', 123),
 ('are', 121),
 ("'s", 119),
 ('if', 111),
 ('all', 109),
 ('then', 108),
 ('shall', 107),
 ('by', 105),
 ('come', 104),
 ('let', 104)]

In [6]:
# check the sentence that contains the word 
nltk.Text(hamlet_tokens).concordance("ham")

Displaying 25 of 337 matches:
 now my cosin hamlet , and my sonne ? ham . a little more then kin , and lesse 
t that the clouds still hang on you ? ham . not so my lord , i am too much i'th
 passing through nature , to eternity ham . i madam , it is common queen . if i
why seemes it so particular with thee ham . seemes madam ? nay , it is : i know
e stay with vs , go not to wittenberg ham . i shall in all my best obey you mad
. come away . exeunt . manet hamlet . ham . oh that this too too solid flesh , 
cellus . hor . haile to your lordship ham . i am glad to see you well : horatio
my lord , and your poore seruant euer ham . sir my good friend , ile change tha
oratio ? marcellus mar . my good lord ham . i am very glad to see you : good eu
. a truant disposition , good my lord ham . i would not haue your enemy say so 
, i came to see your fathers funerall ham . i pray thee doe not mock me ( fello
ndeed my lord , it followed hard vpon ham . thrift thrift horatio : the funeral
ee my fath

#### Summary of the analysis:
* There are many tokens that are non-alphabatic characters   
* Some of the stopwords need to be removed; ex: the, a, so, ...etc
* Role names appear a lot and they are not useful to compare tow plays

#### Strategies:
* alpha_filter function to remove non-alphabatic tokens
* remove stopwords with nltk stopwords, new stopwords such as: 'thou', 'thee', 'thy'
* remove character's name for the play

#### non-alphabatic tokens

To remove the non-alphabatic characters, build a user define function.

In [7]:
# function that takes a word and returns true if it consists only of non-alphabetic characters
def alpha_filter(w):
    '''
    Expect: string consists of alphabetic characters and non-alphabetic characters, ex: '[', 'the', 'tragedie'
    Modifies: match the word that consists only of non-alphabetic characters, ex: '['
    Returns: True while there is a match; otherwise, it returns false
    '''
    import re
  # pattern to match word of non-alphabetical characters
    pattern = re.compile('^[^a-z]+$')  # from start to the end, there are no alphabatic characters
    if (pattern.match(w)):
        return True
    else:
        return False

In [8]:
# test alpha_filter function 
alpha_filter("'s")

False

In [9]:
# apply the function to hamlet_tokens

# Before
print('Before counts: {}'.format(len(hamlet_tokens)))
print('Before filtering : \n{}\n'.format(hamlet_tokens[:100]))


# After
# store word in a list comprehension when the condition is ture (word that not consists only of non-alphabetic characters)
alph_hamlet_words = [w for w in hamlet_tokens if not alpha_filter(w)] 
print('After counts: {}'.format(len(alph_hamlet_words)))
print('After filtering : \n{}\n'.format(alph_hamlet_words[:100]))

# Summary
print('Number of words reduced: {}'.format(len(hamlet_tokens)-len(alph_hamlet_words)))

Before counts: 36380
Before filtering : 
['[', 'the', 'tragedie', 'of', 'hamlet', 'by', 'william', 'shakespeare', '1599', ']', 'actus', 'primus', '.', 'scoena', 'prima', '.', 'enter', 'barnardo', 'and', 'francisco', 'two', 'centinels', '.', 'barnardo', '.', 'who', "'s", 'there', '?', 'fran', '.', 'nay', 'answer', 'me', ':', 'stand', '&', 'vnfold', 'your', 'selfe', 'bar', '.', 'long', 'liue', 'the', 'king', 'fran', '.', 'barnardo', '?', 'bar', '.', 'he', 'fran', '.', 'you', 'come', 'most', 'carefully', 'vpon', 'your', 'houre', 'bar', '.', "'t", 'is', 'now', 'strook', 'twelue', ',', 'get', 'thee', 'to', 'bed', 'francisco', 'fran', '.', 'for', 'this', 'releefe', 'much', 'thankes', ':', "'t", 'is', 'bitter', 'cold', ',', 'and', 'i', 'am', 'sicke', 'at', 'heart', 'barn', '.', 'haue', 'you', 'had', 'quiet']

After counts: 30055
After filtering : 
['the', 'tragedie', 'of', 'hamlet', 'by', 'william', 'shakespeare', 'actus', 'primus', 'scoena', 'prima', 'enter', 'barnardo', 'and', 'francisco', 

In [10]:
# new words freqent distribution
FreqDist(alph_hamlet_words).most_common(50)

[('the', 993),
 ('and', 862),
 ('to', 683),
 ('of', 610),
 ('i', 559),
 ('you', 527),
 ('my', 502),
 ('a', 497),
 ('it', 419),
 ('in', 388),
 ('that', 376),
 ('is', 372),
 ('ham', 337),
 ('not', 327),
 ('his', 285),
 ('this', 275),
 ('with', 254),
 ('your', 253),
 ('but', 249),
 ('for', 243),
 ('me', 228),
 ('what', 211),
 ('lord', 211),
 ('as', 205),
 ('he', 202),
 ("'d", 200),
 ('be', 191),
 ('so', 189),
 ('him', 178),
 ('haue', 175),
 ('king', 172),
 ('will', 149),
 ('no', 137),
 ('our', 130),
 ('we', 128),
 ('on', 123),
 ('are', 121),
 ("'s", 119),
 ('if', 111),
 ('all', 109),
 ('then', 108),
 ('shall', 107),
 ('by', 105),
 ('come', 104),
 ('let', 104),
 ('thou', 104),
 ('or', 103),
 ('do', 101),
 ('hamlet', 100),
 ('good', 98)]

#### stopwords

In [11]:
# get a list of stopwords from nltk
nltkstopwords = nltk.corpus.stopwords.words('english')
print('Number of the NLTK stopwords: {}\n'.format(len(nltkstopwords)))
print('List of the NLTK stopwords: \n{}'.format(nltkstopwords))

Number of the NLTK stopwords: 179

List of the NLTK stopwords: 
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 's

In the play of Hamlet, there are 20 more main charactors:

|Charactors|Abbrev.| Role|
|:--|:--|:--|
|Hamlet | Ham | son of the late king and nephew of the present king, Claudius|
|Claudius | – | king of Denmark, Hamlet's uncle and brother to the former king|
|Gertrude | – | queen of Denmark and Hamlet's mother|
|Polonius | Pol/Polon | chief counsellor to the king|
|Ophelia | Ophe | Polonius's daughter|
|Horatio | Hor/Hora | friend of Hamlet|
|Laertes | Laer | Polonius's son|
|Voltimand and Cornelius | Volt | courtiers|
|Rosencrantz and Guildenstern | Rosin/Guil | courtiers, friends of Hamlet|
|Osric | Osr | a courtier|
|Marcellus | Mar | an officer|
|Barnardo | Bar | an officer|
|Francisco | Fran | a soldier|
|Reynaldo | Reynol | Polonius's servant|
|Ghost | – | the ghost of Hamlet's father|
|Fortinbras | Fortin | prince of Norway|
|Gravediggers | – | a pair of sextons|
|Player King, Player Queen, Lucianus, etc. | King/Qu/kin/Lucian | |

In [12]:
# check the sentence that contains the word 
nltk.Text(hamlet_tokens).concordance("heauen")

Displaying 25 of 43 matches:
 his course t ' illume that part of heauen where now it burnes , marcellus and
d denmarke did sometimes march : by heauen i charge thee speake mar . it is of
 it shewes a will most incorrect to heauen , a heart vnfortified , a minde imp
t to heart ? fye , 't is a fault to heauen , a fault against the dead , a faul
 he might not beteene the windes of heauen visit her face too roughly . heauen
heauen visit her face too roughly . heauen and earth must i remember : why she
l teares . why she , euen she . ( o heauen ! a beast that wants discourse of r
; would i had met my dearest foe in heauen , ere i had euer seene that day hor
hew me the steepe and thorny way to heauen ; whilst like a puft and recklesse 
h , my lord , with all the vowes of heauen polon . i , springes to catch woodc
amn 'd , bring with thee ayres from heauen , or blasts from hell , be thy euen
tten in the state of denmarke hor . heauen will direct it mar . nay , let 's f
euer thy deare father l

In [13]:
## add some extra stopwords
morestopwords = ['laer','laertes','horatio', 'rosin','ophe','hor','ham','hamlet', 'qu', 'queene','polon','pol',
                 "'s",'sha','wo','y',"'s","'d","'ll","'t","'m","'re","'ve", "n't"]
stopwords = nltkstopwords + morestopwords
print('Number of the updated stopwords: {}\n'.format(len(stopwords)))
print('List of the updated stopwords: \n{}'.format(stopwords))

Number of the updated stopwords: 203

List of the updated stopwords: 
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'som

In [14]:
# apply the stopwords to alph_hamlet_words

# Before
print('Before counts: {}'.format(len(alph_hamlet_words)))
print('Before filtering : \n{}\n'.format(alph_hamlet_words[:100]))


# After
# store word in a list comprehension when the condition is ture (word that not consists only of non-alphabetic characters)
stopped_alph_hamlet_words = [w for w in alph_hamlet_words if not w in stopwords] 
print('After counts: {}'.format(len(stopped_alph_hamlet_words)))
print('After filtering : \n{}\n'.format(stopped_alph_hamlet_words[:100]))

# Summary
print('Number of words reduced: {}'.format(len(alph_hamlet_words)-len(stopped_alph_hamlet_words)))

Before counts: 30055
Before filtering : 
['the', 'tragedie', 'of', 'hamlet', 'by', 'william', 'shakespeare', 'actus', 'primus', 'scoena', 'prima', 'enter', 'barnardo', 'and', 'francisco', 'two', 'centinels', 'barnardo', 'who', "'s", 'there', 'fran', 'nay', 'answer', 'me', 'stand', 'vnfold', 'your', 'selfe', 'bar', 'long', 'liue', 'the', 'king', 'fran', 'barnardo', 'bar', 'he', 'fran', 'you', 'come', 'most', 'carefully', 'vpon', 'your', 'houre', 'bar', "'t", 'is', 'now', 'strook', 'twelue', 'get', 'thee', 'to', 'bed', 'francisco', 'fran', 'for', 'this', 'releefe', 'much', 'thankes', "'t", 'is', 'bitter', 'cold', 'and', 'i', 'am', 'sicke', 'at', 'heart', 'barn', 'haue', 'you', 'had', 'quiet', 'guard', 'fran', 'not', 'a', 'mouse', 'stirring', 'barn', 'well', 'goodnight', 'if', 'you', 'do', 'meet', 'horatio', 'and', 'marcellus', 'the', 'riuals', 'of', 'my', 'watch', 'bid']

After counts: 14776
After filtering : 
['tragedie', 'william', 'shakespeare', 'actus', 'primus', 'scoena', 'prima', '

In [15]:
# new words freqent distribution
FreqDist(stopped_alph_hamlet_words).most_common(50)

[('lord', 211),
 ('haue', 175),
 ('king', 172),
 ('shall', 107),
 ('come', 104),
 ('let', 104),
 ('thou', 104),
 ('good', 98),
 ('thy', 90),
 ('enter', 85),
 ('oh', 81),
 ('like', 77),
 ('well', 70),
 ('know', 69),
 ('would', 68),
 ('selfe', 66),
 ('may', 65),
 ('loue', 65),
 ('sir', 62),
 ('vs', 61),
 ('giue', 59),
 ('thee', 58),
 ('ile', 58),
 ('must', 58),
 ('hath', 57),
 ('speake', 55),
 ('make', 54),
 ('say', 51),
 ('doe', 51),
 ('vpon', 50),
 ('heere', 50),
 ('father', 50),
 ('go', 48),
 ('one', 46),
 ('see', 46),
 ('man', 46),
 ('time', 44),
 ('mine', 44),
 ('much', 43),
 ('heauen', 43),
 ('tell', 43),
 ('thinke', 42),
 ('thus', 41),
 ('mother', 40),
 ('play', 40),
 ('night', 38),
 ('yet', 37),
 ('death', 36),
 ('vp', 35),
 ('againe', 34)]

### list the top 50 words by frequency (normalized by the length of the document)

In [16]:
hamlet_fdist = FreqDist(stopped_alph_hamlet_words)

# Total length of the document
hamlet_total_word_count = sum(hamlet_fdist.values())

print("Normalized Frequency for top 50 words:\n")
for word, count in hamlet_fdist.most_common(50):
    normalized_frequency = round((count / hamlet_total_word_count), 4)
    print("{:8s} :    {:0<.4}".format(word, normalized_frequency))

Normalized Frequency for top 50 words:

lord     :    0.0143
haue     :    0.0118
king     :    0.0116
shall    :    0.0072
come     :    0.007
let      :    0.007
thou     :    0.007
good     :    0.0066
thy      :    0.0061
enter    :    0.0058
oh       :    0.0055
like     :    0.0052
well     :    0.0047
know     :    0.0047
would    :    0.0046
selfe    :    0.0045
may      :    0.0044
loue     :    0.0044
sir      :    0.0042
vs       :    0.0041
giue     :    0.004
thee     :    0.0039
ile      :    0.0039
must     :    0.0039
hath     :    0.0039
speake   :    0.0037
make     :    0.0037
say      :    0.0035
doe      :    0.0035
vpon     :    0.0034
heere    :    0.0034
father   :    0.0034
go       :    0.0032
one      :    0.0031
see      :    0.0031
man      :    0.0031
time     :    0.003
mine     :    0.003
much     :    0.0029
heauen   :    0.0029
tell     :    0.0029
thinke   :    0.0028
thus     :    0.0028
mother   :    0.0027
play     :    0.0027
night    :    0.0026


### list the top 50 bigrams by frequencies
The preprocessed tokens were used to get a list of bigrams.

In [17]:
# Bigrams and Bigram frequency distribution
hamelt_bigrams = list(nltk.bigrams(stopped_alph_hamlet_words))

# Summary
#unigram
print('unigram: \n{}\n'.format(stopped_alph_hamlet_words[:20]))
#bigram
print('bigram: \n{}\n'.format(hamelt_bigrams[:20]))

unigram: 
['tragedie', 'william', 'shakespeare', 'actus', 'primus', 'scoena', 'prima', 'enter', 'barnardo', 'francisco', 'two', 'centinels', 'barnardo', 'fran', 'nay', 'answer', 'stand', 'vnfold', 'selfe', 'bar']

bigram: 
[('tragedie', 'william'), ('william', 'shakespeare'), ('shakespeare', 'actus'), ('actus', 'primus'), ('primus', 'scoena'), ('scoena', 'prima'), ('prima', 'enter'), ('enter', 'barnardo'), ('barnardo', 'francisco'), ('francisco', 'two'), ('two', 'centinels'), ('centinels', 'barnardo'), ('barnardo', 'fran'), ('fran', 'nay'), ('nay', 'answer'), ('answer', 'stand'), ('stand', 'vnfold'), ('vnfold', 'selfe'), ('selfe', 'bar'), ('bar', 'long')]



In [18]:
hamlet_bigrams_fdist = FreqDist(hamelt_bigrams)
for words, count in hamlet_bigrams_fdist.most_common(50):
    print("{:32} : {:4}".format(str(words), count))

('good', 'lord')                 :   23
('enter', 'king')                :   16
('wee', 'l')                     :   13
('haue', 'seene')                :   11
('exeunt', 'enter')              :   10
('thou', 'hast')                 :    9
('haue', 'heard')                :    9
('enter', 'polonius')            :    9
('lord', 'haue')                 :    9
('fathers', 'death')             :    8
('let', 'vs')                    :    7
('thou', 'art')                  :    7
('king', 'haue')                 :    7
('would', 'haue')                :    7
('set', 'downe')                 :    7
('good', 'friends')              :    7
('well', 'lord')                 :    7
('rosincrance', 'guildensterne') :    7
('let', 'see')                   :    7
('dost', 'thou')                 :    7
('king', 'king')                 :    7
('mine', 'owne')                 :    6
('king', 'oh')                   :    6
('ile', 'haue')                  :    6
('let', 'come')                  :    6


### list the top 50 bigrams by their Mutual Information scores (using min frequency 5) 
The preprocessed tokens were used to get a list of bigrams.

Not every pair if words throughout the tokens list will convey large amounts of information. NLTK provides the Pointwise Mutual Information (PMI) scorer object which assigns a statistical metric to compare each bigram. The method also allows you to filter out token pairs that appear less than a minimum amount of times.

The pointwise mutual information represents a quantified measure for how much more- or less likely we are to see the two events co-occur, given their individual probabilities, and relative to the case where the two are completely independent.

When the pmi is higher, it means that the two words are likely to express a unique concept.

In [19]:
# setup for bigrams and bigram measures
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()

In [20]:
# create the bigram finder and score the bigrams by frequency
finder = BigramCollocationFinder.from_words(stopped_alph_hamlet_words)
finder.apply_freq_filter(5) #remove the word frequency below 5
scored = finder.score_ngrams(bigram_measures.pmi)
for bscore in scored[:50]:
    print (bscore)

(('rosincrance', 'guildensterne'), 9.111428613152404)
(('wee', 'l'), 8.644517273515014)
(('sit', 'downe'), 8.472456527728708)
(('set', 'downe'), 7.498451736261652)
(('fathers', 'death'), 7.359115054652763)
(('dost', 'thou'), 6.498451736261652)
(('wilt', 'thou'), 6.472456527728708)
(('heauen', 'earth'), 6.354314068388943)
(('enter', 'polonius'), 6.289574121399687)
(('exeunt', 'enter'), 6.239943353675086)
(('mine', 'owne'), 6.022302722679424)
(('thou', 'art'), 5.957883354898948)
(('good', 'friends'), 5.7956857154812464)
(('haue', 'heard'), 5.762327118534818)
(('thou', 'hast'), 5.56556593212019)
(('haue', 'seene'), 5.537260562900045)
(('enter', 'ghost'), 5.371187886953338)
(('tell', 'vs'), 4.815894153604816)
(('reynol', 'lord'), 4.751357339021524)
(('shall', 'heare'), 4.524538663860136)
(('let', 'see'), 4.434321398841936)
(('king', 'polonius'), 4.424703396280339)
(('good', 'lord'), 4.038721074217058)
(('let', 'vs'), 4.027146017336062)
(('enter', 'king'), 4.015312460142638)
(('lord', 'exeu

#### trigrams

In [21]:
# trigrams
for words, count in FreqDist(nltk.trigrams(alph_hamlet_words)).most_common(50):
    print("{:32} : {:4}".format(str(words), count))

('my', 'lord', 'ham')            :   62
('my', 'lord', 'i')              :   18
('good', 'my', 'lord')           :   14
('i', 'my', 'lord')              :   13
('i', 'pray', 'you')             :   12
('lord', 'ham', 'i')             :   11
('that', 'i', 'haue')            :   11
('i', 'can', 'not')              :    9
("'t", 'is', 'a')                :    8
('my', 'good', 'lord')           :    8
('it', 'is', 'a')                :    8
('my', 'lord', 'polon')          :    7
('lord', 'i', 'haue')            :    7
('let', 'me', 'see')             :    7
('i', 'haue', 'seene')           :    6
('i', 'know', 'not')             :    6
('if', 'it', 'be')               :    6
('well', 'my', 'lord')           :    6
('rosincrance', 'and', 'guildensterne') :    6
('you', 'can', 'not')            :    5
('it', 'is', 'not')              :    5
('ham', 'i', 'am')               :    5
('i', 'do', 'not')               :    5
('ophe', 'my', 'lord')           :    5
('and', 'with', 'a')             

## shakespeare-macbeth

Get the text of the book, `Macbeth from William Shakespeare`, from the Gutenberg corpus, tokenize it, and reduce the tokens to lowercase.

In [22]:
# The book filename
macbeth = nltk.corpus.gutenberg.fileids( )[16]

# Raw text of the book 
macbethtext = nltk.corpus.gutenberg.raw(macbeth)
print(macbethtext[:500])

[The Tragedie of Macbeth by William Shakespeare 1603]


Actus Primus. Scoena Prima.

Thunder and Lightning. Enter three Witches.

  1. When shall we three meet againe?
In Thunder, Lightning, or in Raine?
  2. When the Hurley-burley's done,
When the Battaile's lost, and wonne

   3. That will be ere the set of Sunne

   1. Where the place?
  2. Vpon the Heath

   3. There to meet with Macbeth

   1. I come, Gray-Malkin

   All. Padock calls anon: faire is foule, and foule is faire,
Houer through 


In [23]:
# convert to lowercase
macbeth_text_lower = macbethtext.lower()
macbeth_tokens = nltk.word_tokenize(macbeth_text_lower) 

In [24]:
print('There are {} words in the book of Macbeth from William Shakespeare'.format(len(macbeth_tokens)))

There are 22188 words in the book of Macbeth from William Shakespeare


In [25]:
# Creating a frequency distribution of words - dictionary
fdist = FreqDist(macbeth_tokens)
fdist.most_common(50)

[(',', 1962),
 ('.', 1174),
 ('the', 649),
 ('and', 545),
 (':', 477),
 ('to', 383),
 ('of', 338),
 ('i', 331),
 ('?', 241),
 ('a', 239),
 ('that', 236),
 ('is', 211),
 ('you', 205),
 ('my', 203),
 ('in', 200),
 ("'d", 192),
 ('not', 185),
 ('it', 161),
 ('with', 153),
 ('his', 146),
 ('be', 137),
 ('macb', 137),
 ("'s", 128),
 ('your', 126),
 ('our', 123),
 ('haue', 122),
 ('but', 120),
 ('what', 117),
 ('me', 113),
 ('he', 112),
 ('for', 109),
 ('this', 104),
 ('all', 100),
 ('so', 96),
 ('him', 90),
 ('as', 89),
 ('thou', 87),
 ('we', 83),
 ('enter', 81),
 ('which', 80),
 ("'", 74),
 ('are', 73),
 ('will', 72),
 ('they', 70),
 ('shall', 68),
 ('no', 67),
 ('then', 63),
 ('do', 63),
 ('macbeth', 62),
 ('their', 62)]

In [26]:
# check the sentence that contains the word 
nltk.Text(macbeth_tokens).concordance("hor")

no matches


#### Summary of the analysis:
* There are many tokens that are non-alphabatic characters   
* Some of the stopwords need to be removed; ex: the, a, so, ...etc
* Role names appear a lot and they are not useful to compare tow plays

#### Strategies:
* alpha_filter function to remove non-alphabatic tokens
* remove stopwords with nltk stopwords, new stopwords such as: 'thou', 'thee', 'thy'
* remove chachactors' name for the play

#### non-alphabatic tokens

To remove the non-alphabatic characters, build a user define function.

In [27]:
# function that takes a word and returns true if it consists only of non-alphabetic characters
def alpha_filter(w):
    '''
    Expect: string consists of alphabetic characters and non-alphabetic characters, ex: '[', 'the', 'tragedie'
    Modifies: match the word that consists only of non-alphabetic characters, ex: '['
    Returns: True while there is a match; otherwise, it returns false
    '''
    import re
  # pattern to match word of non-alphabetical characters
    pattern = re.compile('^[^a-z]+$')  # from start to the end, there are no alphabatic characters
    if (pattern.match(w)):
        return True
    else:
        return False

In [28]:
# test alpha_filter function 
alpha_filter("'s")

False

In [29]:
# apply the function to macbeth_tokens

# Before
print('Before counts: {}'.format(len(macbeth_tokens)))
print('Before filtering : \n{}\n'.format(macbeth_tokens[:100]))


# After
# store word in a list comprehension when the condition is ture (word that not consists only of non-alphabetic characters)
alph_macbeth_words = [w for w in macbeth_tokens if not alpha_filter(w)] 
print('After counts: {}'.format(len(alph_macbeth_words)))
print('After filtering : \n{}\n'.format(alph_macbeth_words[:100]))

# Summary
print('Number of words reduced: {}'.format(len(macbeth_tokens)-len(alph_macbeth_words)))

Before counts: 22188
Before filtering : 
['[', 'the', 'tragedie', 'of', 'macbeth', 'by', 'william', 'shakespeare', '1603', ']', 'actus', 'primus', '.', 'scoena', 'prima', '.', 'thunder', 'and', 'lightning', '.', 'enter', 'three', 'witches', '.', '1.', 'when', 'shall', 'we', 'three', 'meet', 'againe', '?', 'in', 'thunder', ',', 'lightning', ',', 'or', 'in', 'raine', '?', '2.', 'when', 'the', 'hurley-burley', "'s", 'done', ',', 'when', 'the', 'battaile', "'s", 'lost', ',', 'and', 'wonne', '3.', 'that', 'will', 'be', 'ere', 'the', 'set', 'of', 'sunne', '1.', 'where', 'the', 'place', '?', '2.', 'vpon', 'the', 'heath', '3.', 'there', 'to', 'meet', 'with', 'macbeth', '1.', 'i', 'come', ',', 'gray-malkin', 'all', '.', 'padock', 'calls', 'anon', ':', 'faire', 'is', 'foule', ',', 'and', 'foule', 'is', 'faire', ',']

After counts: 18049
After filtering : 
['the', 'tragedie', 'of', 'macbeth', 'by', 'william', 'shakespeare', 'actus', 'primus', 'scoena', 'prima', 'thunder', 'and', 'lightning', 'ent

In [30]:
# new words freqent distribution
FreqDist(alph_macbeth_words).most_common(50)

[('the', 649),
 ('and', 545),
 ('to', 383),
 ('of', 338),
 ('i', 331),
 ('a', 239),
 ('that', 236),
 ('is', 211),
 ('you', 205),
 ('my', 203),
 ('in', 200),
 ("'d", 192),
 ('not', 185),
 ('it', 161),
 ('with', 153),
 ('his', 146),
 ('be', 137),
 ('macb', 137),
 ("'s", 128),
 ('your', 126),
 ('our', 123),
 ('haue', 122),
 ('but', 120),
 ('what', 117),
 ('me', 113),
 ('he', 112),
 ('for', 109),
 ('this', 104),
 ('all', 100),
 ('so', 96),
 ('him', 90),
 ('as', 89),
 ('thou', 87),
 ('we', 83),
 ('enter', 81),
 ('which', 80),
 ('are', 73),
 ('will', 72),
 ('they', 70),
 ('shall', 68),
 ('no', 67),
 ('then', 63),
 ('do', 63),
 ('macbeth', 62),
 ('their', 62),
 ('thee', 61),
 ('vpon', 59),
 ('on', 59),
 ('macd', 58),
 ('from', 57)]

#### stopwords

In [31]:
# get a list of stopwords from nltk
nltkstopwords = nltk.corpus.stopwords.words('english')
print('Number of the NLTK stopwords: {}\n'.format(len(nltkstopwords)))
print('List of the NLTK stopwords: \n{}'.format(nltkstopwords))

Number of the NLTK stopwords: 179

List of the NLTK stopwords: 
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 's

In the play of Macbeth, there are 20 more main charactors:

|Charactors|Abbrev.| Role|
|:--|:--|:--|
|Duncan | – | king of Scotland|
|Malcolm | Malc | Duncan's elder son||
|Donalbain | – | Duncan's younger son|
|Macbeth | Macb | a general in the army of King Duncan; originally Thane of Glamis, then Thane of Cawdor, and later king of Scotland|
|Lady Macbeth | – | Macbeth's wife, and later queen of Scotland|
|Banquo | Banq | Macbeth's friend and a general in the army of King Duncan|
|Fleance | – | Banquo's son|
|Macduff | Macd | Thane of Fife|
|Lady Macduff | – | Macduff's wife|
|Ross, Lennox, Angus, Menteith, Caithness | Rosse/Lenox/Ang/Ment/Cath | Scottish Thanes|
|Siward | – | general of the English forces|
|Young Siward | – | Siward's son|
|Seyton | – | Macbeth's armourer|
|Hecate | – | queen of the witches|
|Three Witches| - ||
|Captain | Cap | in the Scottish army|
|Three Murderers | – | employed by Macbeth|
|Two Murderers | – | attack Lady Macduff|
|Porter | – | gatekeeper at Macbeth's home|
|Doctor | – | Lady Macbeth's doctor||
|Doctor | – | at the English court|
|Gentlewoman | – | Lady Macbeth's caretaker|
|Lord | – | opposed to Macbeth|
|First Apparition | – | armed head|
|Second Apparition | – | bloody child|
|Third Apparition | – | crowned child|
|Attendants, Messengers, Servants, Soldiers | – ||

In [32]:
# check the sentence that contains the word 
nltk.Text(macbeth_tokens).concordance("thane")

Displaying 25 of 25 matches:
 . who comes here ? mal . the worthy thane of rosse lenox . what a haste lookes
g king . whence cam'st thou , worthy thane ? rosse . from fiffe , great king , 
by that most disloyall traytor , the thane of cawdor , began a dismall conflict
our generall vse king . no more that thane of cawdor shall deceiue our bosome i
1. all haile macbeth , haile to thee thane of glamis 2. all haile macbeth , hai
2. all haile macbeth , haile to thee thane of cawdor 3. all haile macbeth , tha
ore : by sinells death , i know i am thane of glamis , but how , of cawdor ? th
f glamis , but how , of cawdor ? the thane of cawdor liues a prosperous gentlem
 banq . you shall be king macb . and thane of cawdor too : went it not so ? ban
r , he bad me , from him , call thee thane of cawdor : in which addition , hail
n which addition , haile most worthy thane , for it is thine banq . what , can 
 the deuill speake true ? macb . the thane of cawdor liues : why doe you dresse
n borrowed 

In [33]:
## add some extra stopwords
morestopwords = ['macb', 'macbeth', 'macd', 'banquo', 'lenox', 'mal','banq','rosse',
                 "'s",'sha','wo','y',"'s","'d","'ll","'t","'m","'re","'ve", "n't"]
stopwords = nltkstopwords + morestopwords
print('Number of the updated stopwords: {}\n'.format(len(stopwords)))
print('List of the updated stopwords: \n{}'.format(stopwords))

Number of the updated stopwords: 199

List of the updated stopwords: 
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'som

In [34]:
# apply the stopwords to alph_hamlet_words

# Before
print('Before counts: {}'.format(len(alph_macbeth_words)))
print('Before filtering : \n{}\n'.format(alph_macbeth_words[:100]))


# After
# store word in a list comprehension when the condition is ture (word that not consists only of non-alphabetic characters)
stopped_alph_macbeth_words = [w for w in alph_macbeth_words if not w in stopwords] 
print('After counts: {}'.format(len(stopped_alph_macbeth_words)))
print('After filtering : \n{}\n'.format(stopped_alph_macbeth_words[:100]))

# Summary
print('Number of words reduced: {}'.format(len(alph_macbeth_words)-len(stopped_alph_macbeth_words)))

Before counts: 18049
Before filtering : 
['the', 'tragedie', 'of', 'macbeth', 'by', 'william', 'shakespeare', 'actus', 'primus', 'scoena', 'prima', 'thunder', 'and', 'lightning', 'enter', 'three', 'witches', 'when', 'shall', 'we', 'three', 'meet', 'againe', 'in', 'thunder', 'lightning', 'or', 'in', 'raine', 'when', 'the', 'hurley-burley', "'s", 'done', 'when', 'the', 'battaile', "'s", 'lost', 'and', 'wonne', 'that', 'will', 'be', 'ere', 'the', 'set', 'of', 'sunne', 'where', 'the', 'place', 'vpon', 'the', 'heath', 'there', 'to', 'meet', 'with', 'macbeth', 'i', 'come', 'gray-malkin', 'all', 'padock', 'calls', 'anon', 'faire', 'is', 'foule', 'and', 'foule', 'is', 'faire', 'houer', 'through', 'the', 'fogge', 'and', 'filthie', 'ayre', 'exeunt', 'scena', 'secunda', 'alarum', 'within', 'enter', 'king', 'malcome', 'donalbaine', 'lenox', 'with', 'attendants', 'meeting', 'a', 'bleeding', 'captaine', 'king', 'what', 'bloody']

After counts: 9489
After filtering : 
['tragedie', 'william', 'shakesp

In [35]:
# new words freqent distribution
FreqDist(stopped_alph_macbeth_words).most_common(50)

[('haue', 122),
 ('thou', 87),
 ('enter', 81),
 ('shall', 68),
 ('thee', 61),
 ('vpon', 59),
 ('yet', 57),
 ('thy', 56),
 ('vs', 56),
 ('come', 54),
 ('king', 53),
 ('hath', 52),
 ('good', 48),
 ('lady', 48),
 ('would', 47),
 ('time', 46),
 ('let', 42),
 ('like', 40),
 ('say', 39),
 ('make', 39),
 ('doe', 38),
 ('lord', 38),
 ('must', 36),
 ('done', 35),
 ('ile', 35),
 ('feare', 35),
 ('wife', 34),
 ('man', 33),
 ('well', 33),
 ('know', 33),
 ('selfe', 32),
 ('one', 32),
 ('great', 31),
 ('see', 31),
 ('may', 31),
 ('exeunt', 30),
 ('speake', 29),
 ('night', 29),
 ('sir', 29),
 ('mine', 26),
 ('vp', 26),
 ('th', 26),
 ('heere', 26),
 ('thane', 25),
 ('giue', 24),
 ('looke', 23),
 ('things', 23),
 ('sleepe', 23),
 ('hand', 23),
 ('blood', 23)]

### list the top 50 words by frequency (normalized by the length of the document)

In [36]:
macbeth_fdist = FreqDist(stopped_alph_macbeth_words)

# Total length of the document
macbeth_total_word_count = sum(hamlet_fdist.values())

print("Normalized Frequency for top 50 words:\n")
for word, count in macbeth_fdist.most_common(50):
    normalized_frequency = round((count / macbeth_total_word_count), 4)
    print("{:8s} :    {:0<.4}".format(word, normalized_frequency))

Normalized Frequency for top 50 words:

haue     :    0.0083
thou     :    0.0059
enter    :    0.0055
shall    :    0.0046
thee     :    0.0041
vpon     :    0.004
yet      :    0.0039
thy      :    0.0038
vs       :    0.0038
come     :    0.0037
king     :    0.0036
hath     :    0.0035
good     :    0.0032
lady     :    0.0032
would    :    0.0032
time     :    0.0031
let      :    0.0028
like     :    0.0027
say      :    0.0026
make     :    0.0026
doe      :    0.0026
lord     :    0.0026
must     :    0.0024
done     :    0.0024
ile      :    0.0024
feare    :    0.0024
wife     :    0.0023
man      :    0.0022
well     :    0.0022
know     :    0.0022
selfe    :    0.0022
one      :    0.0022
great    :    0.0021
see      :    0.0021
may      :    0.0021
exeunt   :    0.002
speake   :    0.002
night    :    0.002
sir      :    0.002
mine     :    0.0018
vp       :    0.0018
th       :    0.0018
heere    :    0.0018
thane    :    0.0017
giue     :    0.0016
looke    :    0.0016

### list the top 50 bigrams by frequencies
The preprocessed tokens were used to get a list of bigrams.

In [37]:
# Bigrams and Bigram frequency distribution
macbeth_bigrams = list(nltk.bigrams(stopped_alph_macbeth_words))

# Summary
#unigram
print('unigram: \n{}\n'.format(stopped_alph_macbeth_words[:20]))
#bigram
print('bigram: \n{}\n'.format(macbeth_bigrams[:20]))

unigram: 
['tragedie', 'william', 'shakespeare', 'actus', 'primus', 'scoena', 'prima', 'thunder', 'lightning', 'enter', 'three', 'witches', 'shall', 'three', 'meet', 'againe', 'thunder', 'lightning', 'raine', 'hurley-burley']

bigram: 
[('tragedie', 'william'), ('william', 'shakespeare'), ('shakespeare', 'actus'), ('actus', 'primus'), ('primus', 'scoena'), ('scoena', 'prima'), ('prima', 'thunder'), ('thunder', 'lightning'), ('lightning', 'enter'), ('enter', 'three'), ('three', 'witches'), ('witches', 'shall'), ('shall', 'three'), ('three', 'meet'), ('meet', 'againe'), ('againe', 'thunder'), ('thunder', 'lightning'), ('lightning', 'raine'), ('raine', 'hurley-burley'), ('hurley-burley', 'done')]



In [38]:
macbeth_bigrams_fdist = FreqDist(macbeth_bigrams)
for words, count in macbeth_bigrams_fdist.most_common(50):
    print("{:32} : {:4}".format(str(words), count))

('exeunt', 'scena')              :   15
('thane', 'cawdor')              :   13
('knock', 'knock')               :   10
('thou', 'art')                  :    9
('haue', 'done')                 :    8
('enter', 'lady')                :    8
('good', 'lord')                 :    8
('let', 'vs')                    :    7
('wee', 'l')                     :    7
('enter', 'three')               :    5
('three', 'witches')             :    5
('scena', 'secunda')             :    5
('enter', 'king')                :    5
('worthy', 'thane')              :    5
('thy', 'selfe')                 :    5
('euery', 'one')                 :    5
('would', 'haue')                :    5
('mine', 'eyes')                 :    5
('make', 'vs')                   :    5
('enter', 'malcolme')            :    5
('mine', 'owne')                 :    5
('ten', 'thousand')              :    4
('shew', 'shew')                 :    4
('haue', 'seene')                :    4
('come', 'come')                 :    4


### list the top 50 bigrams by their Mutual Information scores (using min frequency 5) 
The preprocessed tokens were used to get a list of bigrams.

Not every pair if words throughout the tokens list will convey large amounts of information. NLTK provides the Pointwise Mutual Information (PMI) scorer object which assigns a statistical metric to compare each bigram. The method also allows you to filter out token pairs that appear less than a minimum amount of times.

The pointwise mutual information represents a quantified measure for how much more- or less likely we are to see the two events co-occur, given their individual probabilities, and relative to the case where the two are completely independent.

When the pmi is higher, it means that the two words are likely to express a unique concept.

In [39]:
# setup for bigrams and bigram measures
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()

In [40]:
# create the bigram finder and score the bigrams by frequency
finder = BigramCollocationFinder.from_words(stopped_alph_macbeth_words)
finder.apply_freq_filter(5) #remove the word frequency below 5
scored = finder.score_ngrams(bigram_measures.pmi)
count=0
for bscore in scored[:50]:
    count+=1
    print (bscore)
print(count)

(('wee', 'l'), 9.238035549771494)
(('three', 'witches'), 8.83352871798482)
(('scena', 'secunda'), 8.752608722601252)
(('knock', 'knock'), 8.533968436125914)
(('thane', 'cawdor'), 7.876306446826158)
(('exeunt', 'scena'), 7.752608722601252)
(('mine', 'eyes'), 7.374097099347523)
(('worthy', 'thane'), 6.890112246351187)
(('mine', 'owne'), 6.74606587673448)
(('euery', 'one'), 6.533968436125912)
(('thou', 'art'), 5.76909684538982)
(('enter', 'malcolme'), 5.493678715100195)
(('enter', 'three'), 5.493678715100195)
(('good', 'lord'), 5.379150327073807)
(('let', 'vs'), 4.819722918459789)
(('thy', 'selfe'), 4.726613514068308)
(('make', 'vs'), 4.44121129520606)
(('enter', 'lady'), 4.287227837632768)
(('haue', 'done'), 4.152019986730695)
(('enter', 'king'), 3.466197978678087)
(('would', 'haue'), 3.0486422468853878)
21


#### trigrams

In [41]:
# trigrams
for words, count in FreqDist(nltk.trigrams(alph_macbeth_words)).most_common(50):
    print("{:32} : {:4}".format(str(words), count))

('thane', 'of', 'cawdor')        :   13
('the', 'thane', 'of')           :    8
('my', 'good', 'lord')           :    8
('i', 'pray', 'you')             :    7
('can', 'not', 'be')             :    6
('who', "'s", 'there')           :    6
('knock', 'knock', 'knock')      :    6
('i', 'can', 'not')              :    5
('enter', 'macbeth', 'macb')     :    5
('what', "'s", 'the')            :    5
('i', 'my', 'good')              :    5
('exeunt', 'scena', 'secunda')   :    4
('this', 'is', 'the')            :    4
('all', 'haile', 'macbeth')      :    4
('there', "'s", 'no')            :    4
('it', 'is', 'a')                :    4
('he', 'ha', "'s")               :    4
('i', 'would', 'not')            :    4
('i', 'see', 'thee')             :    4
('i', 'haue', 'done')            :    4
('good', 'lord', 'macb')         :    4
('i', 'will', 'not')             :    4
('my', 'lord', 'macb')           :    4
('she', 'ha', "'s")              :    4
('borne', 'of', 'woman')         :    4
