## NLP lecture 4, comprehension exercises for bigrams and collocations



In class, we saw that [collocations][1] are a useful construct for thinking about word senses, but we didn't really get to look in detail at how collocations can be found and what they look like.  You can find [general discussions of English collocations][2] but it's more interesting to look at how to code up a collocation finder and look at the results.  This notebook fills in the gap.

[1]:https://en.wikipedia.org/wiki/Collocation
[2]:https://en.wikipedia.org/wiki/English_collocations

This uses the [nltk collocation tools, which are explained in this how-to page][3].

[3]:http://www.nltk.org/howto/collocations.html

In [1]:
from __future__ import print_function
import os
import nltk
import operator

I think an easy and convenient way to get started is to pick some text that you are interested in, and use the variable `words` to store a list of the (occurrences in order) of words in the text.  As always, nltk comes with [a variety of text corpora that are easy to access][1], and you can pick one of those.  You can also visit [Project Gutenberg][2] or any other place you're interested in.

[1]:http://www.nltk.org/book/ch02.html
[2]:http://www.gutenberg.org/

In [2]:
words=nltk.corpus.gutenberg.words('chesterton-brown.txt')
len(words)

86063

Once you have your data, the code below will consider bigrams that occur at least 10 times in your corpus and will print out the top 50 bigram collocations, as measured by [pointwise mutual information.][1]

[1]:https://en.wikipedia.org/wiki/Pointwise_mutual_information



In [3]:
if words :
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    word_fd = nltk.FreqDist(words)
    bigram_fd = nltk.FreqDist(nltk.bigrams(words))
    finder = nltk.collocations.BigramCollocationFinder(word_fd, bigram_fd)
    finder.apply_freq_filter(10)
    finder.nbest(bigram_measures.pmi, 50)

In [10]:
sorted_words=sorted(nltk.FreqDist(words).items(),key=operator.itemgetter(1),reverse=True)
sorted_words[:50]

[('the', 4321),
 (',', 4069),
 ('.', 2784),
 ('of', 2087),
 ('and', 2074),
 ('a', 2074),
 ('"', 1461),
 ('to', 1378),
 ('in', 1205),
 ('was', 1141),
 ('I', 1093),
 ('he', 1047),
 ("'", 924),
 ('his', 907),
 ('that', 880),
 ('it', 796),
 (';', 764),
 ('with', 726),
 ('you', 586),
 ('as', 585),
 ('-', 573),
 ('had', 521),
 (',"', 520),
 ('but', 487),
 ('on', 456),
 ('is', 450),
 ('."', 444),
 ('s', 438),
 ('at', 416),
 ('said', 415),
 ('for', 387),
 ('him', 376),
 ('The', 346),
 ('like', 328),
 ('not', 313),
 ('He', 310),
 ('be', 305),
 ('man', 303),
 ('t', 296),
 ('or', 287),
 ('have', 283),
 ('one', 264),
 ('Brown', 261),
 ('by', 259),
 ('this', 254),
 ('all', 253),
 ('?"', 248),
 ('which', 240),
 ('were', 240),
 ('an', 236)]

In [9]:
sorted(nltk.FreqDist(nltk.bigrams(words)),key=operator.itemgetter(1),reverse=True)[:50]

[('and', 'youthful'),
 ('This', 'youth'),
 ('his', 'youth'),
 ('in', 'youth'),
 ('early', 'youth'),
 ('the', 'youth'),
 ('dissipated', 'youth'),
 ('hurt', 'yourselves'),
 ('hang', 'yourselves'),
 ('violent', 'yourself'),
 ('about', 'yourself'),
 ('call', 'yourself'),
 ('Explain', 'yourself'),
 ('offer', 'yourself'),
 ('was', 'yourself'),
 ('of', 'yourself'),
 ('all', 'yourself'),
 ('you', 'yourself'),
 ('proved', 'yourself'),
 ('of', 'yours'),
 ('Not', 'yours'),
 ('me', 'your'),
 ('whom', 'your'),
 ('and', 'your'),
 ('with', 'your'),
 ('putting', 'your'),
 ('compose', 'your'),
 ('till', 'your'),
 ('is', 'your'),
 ('of', 'your'),
 ('never', 'your'),
 ('not', 'your'),
 ('have', 'your'),
 ('Is', 'your'),
 ('as', 'your'),
 ('"', 'your'),
 ('all', 'your'),
 ('be', 'your'),
 ('to', 'your'),
 ('know', 'your'),
 ('did', 'your'),
 ('observe', 'your'),
 ('from', 'your'),
 ('force', 'your'),
 ('in', 'your'),
 ('at', 'your'),
 ('see', 'your'),
 ('keep', 'your'),
 ('off', 'your'),
 ('entreat', 'you

In [8]:
finder = nltk.collocations.BigramCollocationFinder(word_fd, bigram_fd)
finder.apply_freq_filter(10)
finder.nbest(bigram_measures.pmi, 50)

[('Wilson', 'Seymour'),
 ('Sir', 'Claude'),
 ('Sir', 'Wilson'),
 ('Captain', 'Cutler'),
 ('Dr', 'Hood'),
 ('Mrs', 'Boulnois'),
 ('Mr', 'Glass'),
 ('Mr', 'Todhunter'),
 ('Father', 'Brown'),
 ('sat', 'down'),
 ('!"', 'cried'),
 ('?"', 'asked'),
 ('my', 'brother'),
 (',"', 'replied'),
 ('-', 'haired'),
 ('dressing', '-'),
 ('at', 'least'),
 ('other', 'end'),
 (',"', 'answered'),
 ('an', 'instant'),
 ('Do', 'you'),
 ('has', 'been'),
 ('What', 'do'),
 ('two', 'men'),
 (':', '`'),
 ('once', 'more'),
 ('did', 'not'),
 ('cried', 'Flambeau'),
 ('little', 'priest'),
 ('might', 'have'),
 ("'", 's'),
 ("'", 'd'),
 ("'", 'm'),
 ('wouldn', "'"),
 ("'", 've'),
 ('didn', "'"),
 ('don', "'"),
 ('won', "'"),
 ("'", 'll'),
 ('couldn', "'"),
 ('doesn', "'"),
 ('isn', "'"),
 ("'", 't'),
 ('must', 'be'),
 ('Don', "'"),
 ('came', 'out'),
 ("'", 're'),
 ('after', 'all'),
 ('so', 'much'),
 (',"', 'said')]

Comment on the collocations that you find in your list.  In particular, some of the collocations are likely:

- Two word names (give your examples)
- Specialized concepts that happen to be written with two words (give your examples)
- Fixed expressions that are basically idiomatic (give your examples)
- "True collocations" where you see pairs of words frequently used together in particular senses (give your examples)

Are there other patterns that you find?  Which of these are likely to be important for text categorization?  How about for doing disambiguation?  How about for recognizing the meaning of a text?

For this analysis I have used the coprus document of 'Chester Brown' from the Nltk package. Bigram collocation finder measures some of the  names or titles associated with names as ones with high measure. Example of this are  
**[('Wilson', 'Seymour'),('Sir', 'Claude'),('Sir', 'Wilson')].** 

A very interesting that could be noticed was the use of punctuations and  verbs ensuing it. Examples are 
**[ ('!"', 'cried'), ('?"', 'asked'), (',"', 'said')].**

