## Bigrams

In the text: “two great and powerful groups of nations”, the
bigrams are “two great”, “great and”, “and powerful”, etc.

The **frequency of an n-gram** is the percentage of
times the n-gram occurs in all the n-grams of the
corpus and could be useful in corpus statistics
- For bigram xy: Count of bigram xy/Count of all bigrams in corpus




In [None]:
# Getting started to process a text example
import nltk
from nltk import FreqDist
import numpy as np
import pandas as pd

In [10]:
# get the text of the book Emma from the Gutenberg corpus, tokenize it,
#   and reduce the tokens to lowercase.
print (nltk.corpus.gutenberg.fileids(),'\n')
file0 = nltk.corpus.gutenberg.fileids() [0]
emmatext = nltk.corpus.gutenberg.raw(file0)
emmatokens = nltk.word_tokenize(emmatext)
emmawords = [w.lower( ) for w in emmatokens]
# show the number of words and print the first 110 words
print(len(emmawords), '\n')

print(emmawords[ :110])

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt'] 

191673 

['[', 'emma', 'by', 'jane', 'austen', '1816', ']', 'volume', 'i', 'chapter', 'i', 'emma', 'woodhouse', ',', 'handsome', ',', 'clever', ',', 'and', 'rich', ',', 'with', 'a', 'comfortable', 'home', 'and', 'happy', 'disposition', ',', 'seemed', 'to', 'unite', 'some', 'of', 'the', 'best', 'blessings', 'of', 'existence', ';', 'and', 'had', 'lived', 'nearly', 'twenty-one', 'years', 'in', 'the', 'world', 'with', 'very', 'little', 'to', 'distress', 'or', 'vex', 'her', '.', 'she', 'was', 'the', 'youngest', 'of', 'the', 'two', 'daughters', 'of', 'a', 'most', 'affectionate'

In [14]:
#create a frequency distribution of the words
ndist = FreqDist(emmawords)
nitems = ndist.most_common(30)
for item in nitems:
    print (item[0], '\t',item[1])

, 	 12016
. 	 6355
the 	 5198
to 	 5179
and 	 4875
of 	 4284
i 	 3164
a 	 3124
-- 	 3100
it 	 2500
'' 	 2452
her 	 2448
was 	 2396
; 	 2353
she 	 2336
not 	 2279
in 	 2173
be 	 1970
you 	 1962
he 	 1806
that 	 1804
`` 	 1735
had 	 1623
but 	 1439
as 	 1436
for 	 1346
have 	 1319
is 	 1241
with 	 1215
very 	 1202


In [42]:
#Different version of the words using "words" function
emmawords2 = nltk.corpus.gutenberg.words('austen-emma.txt')
emmawords2lowercase = [w.lower() for w in emmawords2]


#Do you think the word lists that emmawords and emmawords2lowercase contain should be
#identical? Try this out and see if this is what you expected: they are different
print(len(emmawords),' ',len(emmawords2lowercase))

#Now try printing some of the words. What difference do you see?
tmp=np.array(emmawords[:50])
tmp2=np.array(emmawords2lowercase[:50])
d = {'emmawords': tmp, 'emmawords2': tmp2}
display(pd.DataFrame(data=d))
#emmawords2 separates twenty-one into 'twenty', '-' and 'one'

191673   192427


Unnamed: 0,emmawords,emmawords2
0,[,[
1,emma,emma
2,by,by
3,jane,jane
4,austen,austen
5,1816,1816
6,],]
7,volume,volume
8,i,i
9,chapter,chapter


In [72]:
#remove all the tokens that have only special characters
#this regular expression pattern matches any word that contains all non-alphabetical
# lower-case characters



import re
pattern = re.compile('^[^a-z]+$') 
#any string that are contrainslower-case letter will not be matched

def alpha_filter(w):
# pattern to match a word of non-alphabetical characters
    pattern = re.compile('^[^a-z]+$')
    if (pattern.match(w)):
        return True
    else:
        return False

print(alpha_filter("aZa"))
print(alpha_filter("ZZ"))


#Apply the new function to emmawords to include only those that don’t match the filter:
alphaemmawords = [w for w in emmawords if not alpha_filter(w)]
print(len(alphaemmawords))
print(alphaemmawords[:100])



False
True
161456
['emma', 'by', 'jane', 'austen', 'volume', 'i', 'chapter', 'i', 'emma', 'woodhouse', 'handsome', 'clever', 'and', 'rich', 'with', 'a', 'comfortable', 'home', 'and', 'happy', 'disposition', 'seemed', 'to', 'unite', 'some', 'of', 'the', 'best', 'blessings', 'of', 'existence', 'and', 'had', 'lived', 'nearly', 'twenty-one', 'years', 'in', 'the', 'world', 'with', 'very', 'little', 'to', 'distress', 'or', 'vex', 'her', 'she', 'was', 'the', 'youngest', 'of', 'the', 'two', 'daughters', 'of', 'a', 'most', 'affectionate', 'indulgent', 'father', 'and', 'had', 'in', 'consequence', 'of', 'her', 'sister', "'s", 'marriage', 'been', 'mistress', 'of', 'his', 'house', 'from', 'a', 'very', 'early', 'period', 'her', 'mother', 'had', 'died', 'too', 'long', 'ago', 'for', 'her', 'to', 'have', 'more', 'than', 'an', 'indistinct', 'remembrance', 'of', 'her', 'caresses']


In [89]:
#remove some of the common words that appear 
#with great frequency.
stopwords = nltk.corpus.stopwords.words('english')

stoppedemmawords=[]
for word in alphaemmawords:
    if word not in stopwords:
        stoppedemmawords.append(word)
#stoppedemmawords = [w for w in alphaemmawords if not w in stopwords]
print(len(stoppedemmawords))
        
#Now we can remake our frequency distribution with our new filtered word list.
emmadist = FreqDist(stoppedemmawords)
emmaitems = emmadist.most_common(30)
for item in emmaitems:
    print(item)


74093
[('mr.', 1089), ("'s", 866), ('emma', 855), ('could', 836), ('would', 818), ('mrs.', 668), ('miss', 597), ('must', 566), ('harriet', 496), ('much', 484), ('said', 483), ('one', 447), ('every', 434), ('weston', 430), ('thing', 394), ('think', 383), ('elton', 378), ('well', 375), ('knightley', 373), ('little', 359), ('never', 358), ('know', 335), ('might', 325), ('good', 313), ('say', 310), ('woodhouse', 308), ('jane', 299), ('quite', 282), ('time', 275), ('great', 263)]


[]
Used to indicate a set of characters. In a set:

- Characters can be listed individually, e.g. [amk] will match 'a', 'm', or 'k'.
- Ranges of characters can be indicated by giving two characters and separating them by a '-', for example [a-z] will match any lowercase ASCII letter, [0-5][0-9] will match all the two-digits numbers from 00 to 59, and [0-9A-Fa-f] will match any hexadecimal digit. If - is escaped (e.g. [a\-z]) or if it’s placed as the first or last character (e.g. [a-]), it will match a literal '-'.
- Special characters lose their special meaning inside sets. For example, [(+*)] will match any of the literal characters '(', '+', '*', or ')'.
- Character classes such as \w or \S (defined below) are also accepted inside a set, although the characters they match depends on whether LOCALE or UNICODE mode is in force.
- Characters that are not within a range can be matched by complementing the set. If the first character of the set is '^', all the characters that are not in the set will be matched. For example, [^5] will match any character except '5', and [^^] will match any character except '^'. ^ has no special meaning if it’s not the first character in the set.
- To match a literal ']' inside a set, precede it with a backslash, or place it at the beginning of the set. For example, both [()[\]{}] and []()[{}] will both match a parenthesis.

In [104]:
#Bigram Frequency Distributions
#Another way to look for interesting characterizations of a corpus is to look at pairs of
#words that are frequently collocated, that is, they occur in a sequence called a bigram.

#look at the bigrams that can be defined.
emmabigrams = list(nltk.bigrams(emmawords))
print(emmabigrams[:20])

#To start using bigrams, we import the collocation finder module.
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()

#you must us the entire list of emmawords before any filtering or the raw
#bigrams will not be correct. Start with all the words and then run the filters in the bigram
#finder.

#The finder then allows us to call other functions to filter the bigrams that it collected and to give scores to the
#bigrams.

finder = BigramCollocationFinder.from_words(emmawords)
scored = finder.score_ngrams(bigram_measures.raw_freq)

print(type(scored))
first = scored[0]

# for bscore in scored[:30]:
#     print (bscore)
    
#Apply alpha_filter on the finder
finder.apply_word_filter(alpha_filter) #filter out the shit we don't want
scored = finder.score_ngrams(bigram_measures.raw_freq)
for bscore in scored[:30]:
    print (bscore)

print('\n',"filtered out the stopwords:\n")

#filter out the stopwords
finder.apply_word_filter(lambda w: w in stopwords)
scored = finder.score_ngrams(bigram_measures.raw_freq)
for bscore in scored[:30]:
    print (bscore)




[('[', 'emma'), ('emma', 'by'), ('by', 'jane'), ('jane', 'austen'), ('austen', '1816'), ('1816', ']'), (']', 'volume'), ('volume', 'i'), ('i', 'chapter'), ('chapter', 'i'), ('i', 'emma'), ('emma', 'woodhouse'), ('woodhouse', ','), (',', 'handsome'), ('handsome', ','), (',', 'clever'), ('clever', ','), (',', 'and'), ('and', 'rich'), ('rich', ',')]
<class 'list'>
(('to', 'be'), 0.0031512002212100818)
(('of', 'the'), 0.002916425370292112)
(('in', 'the'), 0.0023216624146332556)
(('it', 'was'), 0.002306010757905391)
(('i', 'am'), 0.0020451498124409804)
(('she', 'had'), 0.0017321166778836872)
(('she', 'was'), 0.0017112478022465346)
(('had', 'been'), 0.001601686205151482)
(('it', 'is'), 0.0015442967971493115)
(('i', 'have'), 0.0014660385135099885)
(('could', 'not'), 0.0014503868567821237)
(('mr.', 'knightley'), 0.0014138663244171062)
(('of', 'her'), 0.0013564769164149358)
(('mrs.', 'weston'), 0.0012834358516849009)
(('have', 'been'), 0.0012573497571384598)
(('he', 'had'), 0.001252132538229171

In [113]:
#another filter that would remove words that only occurred with a 
#frequency over some minimum threshold 

finder2 = BigramCollocationFinder.from_words(emmawords)
finder2.apply_freq_filter(2)
#Removes candidate ngrams which have frequency less than min_freq.

scored = finder2.score_ngrams(bigram_measures.raw_freq)
for bscore in scored[:20]:
    print (bscore)


((',', 'and'), 0.00981880598728042)
(('.', "''"), 0.0060363222780464645)
(("''", '``'), 0.005003312934007398)
((';', 'and'), 0.004523328794352882)
(('to', 'be'), 0.0031512002212100818)
((',', "''"), 0.0030468558430243172)
(('.', 'i'), 0.0029738147782942823)
((',', 'i'), 0.0029685975593849944)
(('of', 'the'), 0.002916425370292112)
(('in', 'the'), 0.0023216624146332556)
(('it', 'was'), 0.002306010757905391)
((';', 'but'), 0.0022277524742660678)
(('.', '``'), 0.0021703630662638974)
(('.', 'she'), 0.002154711409536033)
(('i', 'am'), 0.0020451498124409804)
((',', 'that'), 0.0018781988073437574)
(('!', '--'), 0.001794723304795146)
(('--', 'and'), 0.0017425511157022637)
(('she', 'had'), 0.0017321166778836872)
(('she', 'was'), 0.0017112478022465346)


In [116]:
#filter out the words whose length is small
finder2.apply_ngram_filter(lambda w1, w2: len(w1) < 2)
scored = finder2.score_ngrams(bigram_measures.raw_freq)
for bscore in scored[:20]:
    print (bscore)

(("''", '``'), 0.005003312934007398)
(('to', 'be'), 0.0031512002212100818)
(('of', 'the'), 0.002916425370292112)
(('in', 'the'), 0.0023216624146332556)
(('it', 'was'), 0.002306010757905391)
(('--', 'and'), 0.0017425511157022637)
(('she', 'had'), 0.0017321166778836872)
(('she', 'was'), 0.0017112478022465346)
(('had', 'been'), 0.001601686205151482)
(('it', 'is'), 0.0015442967971493115)
(('could', 'not'), 0.0014503868567821237)
(('mr.', 'knightley'), 0.0014138663244171062)
(("''", 'said'), 0.0013825630109613768)
(('``', 'i'), 0.0013616941353242241)
(('of', 'her'), 0.0013564769164149358)
(('--', 'i'), 0.0013408252596870712)
(('mrs.', 'weston'), 0.0012834358516849009)
(('have', 'been'), 0.0012573497571384598)
(('he', 'had'), 0.0012521325382291715)
(('to', 'the'), 0.001236480881501307)


## Mutual Information and other scorers
N-Gram probabilities predict the next word, Mutual
Information computes probability of two words
occurring in sequence
- Given a pair of words, compares probability that the two occur together as a joint event to the probability they occur individually & that their co-occurrences are simply the result of chance
- The more strongly connected 2 items are, the higher will be their MI value


• Based on work of Church & Hanks (1990), generalizing MI from
information theory to apply to words in sequence
– They used terminology Association Ratio
- P(x) and P(y) are estimated by the number of
observations of x and y in a corpus and normalized by N,
the size of the corpus
- P(x,y) is the number of times that x is followed by y in a
window of w words
- Mutual Information score (also sometimes called PMI,
Pointwise Mutual Information):
PMI (x,y) = log2 ( P(x,y) / P(x) P(y) )



In [137]:
#In NLTK, the mutual information score is given by a function for Pointwise Mutual Information,
#where this is the version without the window.
finder3 = BigramCollocationFinder.from_words(emmawords)
scored = finder3.score_ngrams(bigram_measures.pmi)
for bscore in scored[:30]:
    print (bscore)

print ('\n')
#It is recommended to run the PMI scorer with a minimum frequency of 5, which will make more sense on very large documents.
finder3.apply_freq_filter(5)
scored = finder3.score_ngrams(bigram_measures.pmi)
for bscore in scored[:30]:
    print (bscore)


(('26th', 'ult.'), 17.54828760064729)
(('_______', 'regiment'), 17.54828760064729)
(('_a_', '_source_'), 17.54828760064729)
(('_amor_', '_patriae_'), 17.54828760064729)
(('_and_', '_misery_'), 17.54828760064729)
(('_any_', '_thing_'), 17.54828760064729)
(('_be_', '_a_'), 17.54828760064729)
(('_caro_', '_sposo_'), 17.54828760064729)
(('_dissolved_', '_it_.'), 17.54828760064729)
(('_great_', '_way_'), 17.54828760064729)
(('_most_', '_precious_'), 17.54828760064729)
(('_precious_', '_treasures_'), 17.54828760064729)
(('_repentance_', '_and_'), 17.54828760064729)
(('_rev._', '_philip_'), 17.54828760064729)
(('_robin_', '_adair_'), 17.54828760064729)
(('_small_', 'half-glass'), 17.54828760064729)
(('_with_', '_time_'), 17.54828760064729)
(('`our', 'lot'), 17.54828760064729)
(('adequate', 'restoratives'), 17.54828760064729)
(('austen', '1816'), 17.54828760064729)
(('baronne', "d'almane"), 17.54828760064729)
(('base', 'aspersion'), 17.54828760064729)
(('bulky', 'forms'), 17.54828760064729)
((

In [143]:
'''
Exercise for Week 2:

For this exercise,
- Choose a file that you want to work on, 
either one of the files from the book corpus, 
or one from the Gutenberg corpus.
'''

#Import required modules
import nltk
from nltk import FreqDist
from nltk.corpus import brown

#check what file they've got in gutenberg
nltk.corpus.gutenberg.fileids()

#I will pick 'shakespeare-hamlet.txt'
file0 = nltk.corpus.gutenberg.fileids()[-3] 
#file0 = 'shakespeare-hamlet.txt'

#1. get the text with nltk.corpus.gutenberg.raw()
hamlettext=nltk.corpus.gutenberg.raw(file0)

#2. Get the tokens with nltk.word_tokenize()
hamlettokens = nltk.word_tokenize(hamlettext)

#3. Get the words by using w.lower() to lowercase the tokens
hamletwords = [w.lower() for w in hamlettokens]

#4. make a bigram finder
#without a filter
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(hamletwords)
scored = finder.score_ngrams(bigram_measures.raw_freq)
display(pd.DataFrame(scored[:20],columns=['bigrams','freq']))


#with alpha filter and stopwords filter 
alphahamletwords = [w for w in hamletwords if not alpha_filter(w)]
finder = BigramCollocationFinder.from_words(alphahamletwords)
finder.apply_word_filter(lambda w: w in stopwords)
scored = finder.score_ngrams(bigram_measures.raw_freq)
display(pd.DataFrame(scored[:20],columns=['bigrams','freq']))


#PMI score with min freq of 5
finder = BigramCollocationFinder.from_words(hamletwords)
finder.apply_freq_filter(5)
scored = finder.score_ngrams(bigram_measures.pmi)
display(pd.DataFrame(scored[:20],columns=['bigrams','PMI Score']))



'''
- Make a bigram finder and experiment with whether to 
apply the filters or not . 
Run the scoring with both the raw frequency 
and the pmi scorers and compare results.

To complete the exercise, choose one of your top 20 frequency lists to report to show to the
class. Write an introductory sentence of paragraph telling what text you chose and what
bigram filters and scorer you used. Put this and the frequency list in a discussion posting in
the blackboard system under the Discussions tab.
'''

'''
For the exercise, I chose Hamlet from guterburg
Here are the frequency lists I experimented:
For bigram frequency, I used:
1. Without a filter
2. Filtered out the stopwords
For PMI score, I filtered out the bigram frequencies lower than 5
Here are the results

'''

Unnamed: 0,bigrams,freq
0,"(,, and)",0.012801
1,"(ham, .)",0.009277
2,"(my, lord)",0.004817
3,"(., i)",0.004157
4,"(,, that)",0.003744
5,"(,, i)",0.002808
6,"(king, .)",0.002643
7,"(hor, .)",0.002615
8,"(,, the)",0.00256
9,"(,, to)",0.002175


Unnamed: 0,bigrams,freq
0,"(lord, ham)",0.002231
1,"(hamlet, ham)",0.0005
2,"(enter, king)",0.000466
3,"(ha, 's)",0.000466
4,"(enter, hamlet)",0.000333
5,"(exeunt, enter)",0.000333
6,"(ham, oh)",0.000333
7,"(haue, seene)",0.000333
8,"(let, 's)",0.000333
9,"(lord, hamlet)",0.000333


Unnamed: 0,bigrams,PMI Score
0,"(th, ')",9.217978
1,"(o, 're)",9.20891
2,"(i'th, ')",9.143977
3,"(any, thing)",8.798218
4,"(fathers, death)",8.656862
5,"(set, downe)",8.310772
6,"(our, selues)",8.126347
7,"(o, n't)",7.801252
8,"(dost, thou)",7.796198
9,"(wilt, thou)",7.770203


'\nFor the exercise, I chose Hamlet from guterburg\nHere are the frequency lists I experimented:\nFor bigram frequency, I used:\n1. Without a filter\n2. Filtered out the stopwords\nFor PMI score, I filtered out the bigram frequencies lower than 5\nHere are the results\n\n'