## Collocations and N-grams

NLTK book: [Collocations and Bigrams](https://www.nltk.org/book/ch01#collocations-and-bigrams)

In [None]:
import nltk

# make sure that NLTK language resources have been downloaded 
# (see "NLTK Introduction" notebook)

from nltk.book import *

**Bigrams** are just pairs of words that occur in the text.

**Collocations** are sequences (e.g. pairs) of words that occur together unusually often.

In [None]:
help(nltk.bigrams)

In [None]:
t1_bigrams = nltk.bigrams(text1[:10])

# to print bigrams, convert it to a list
list(t1_bigrams)

To find **collocations**, we want to find bigrams that occur more often than we would expect based on the frequency of the individual words.

Additional information:
* [nltk.text.Text](https://www.nltk.org/api/nltk.html#nltk.text.Text)
* [NLTK documentation: collocations](https://www.nltk.org/api/nltk.html#module-nltk.collocations)

In [None]:
coll_list = text1.collocation_list()

coll_list

In [None]:
# we can also look for trigram collocations

# http://www.nltk.org/howto/collocations.html

coll_3 = nltk.collocations.TrigramCollocationFinder.from_words(text1)

trigram_measures = nltk.collocations.TrigramAssocMeasures()
scored = coll_3.score_ngrams(trigram_measures.raw_freq)

sorted(coll_3.nbest(trigram_measures.raw_freq, 15))

Note: in order to find unusual trigrams we would need to filter the results (and pick the most appropriate collocation measure) like it is done in `Text.collocation_list()`. 

Source code for `collocation_list()`: https://www.nltk.org/_modules/nltk/text.html#Text.collocation_list

```
            # print("Building collocations list")
            from nltk.corpus import stopwords

            ignored_words = stopwords.words("english")
            
            finder = BigramCollocationFinder.from_words(self.tokens, window_size)
            finder.apply_freq_filter(2)
            finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in ignored_words)
            
            bigram_measures = BigramAssocMeasures()
            self._collocations = finder.nbest(bigram_measures.likelihood_ratio, num)
```