## Collocations and N-grams

NLTK book: [Collocations and Bigrams](https://www.nltk.org/book/ch01#collocations-and-bigrams)

In [1]:
import nltk

# make sure that NLTK language resources have been downloaded 
# (see "NLTK Introduction" notebook)

from nltk.book import text1, text2, text3, text4, text5, text6, text7, text8, text9

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


**Bigrams** are just pairs of words that occur in the text.

**Tri-grams** are triplets of words that occur in the text.

**N-grams** are sequences of N words.

**Collocations** are sequences (e.g. pairs) of words that occur together unusually often.

### Bigrams vs Collocations

The difference between bigrams and collocations is that bigrams are any two words that occur consecutively in the text, whereas collocations are only those pairs of words that occur more often than we would expect based on the frequency of the individual words.

In [2]:
help(nltk.bigrams)

Help on function bigrams in module nltk.util:

bigrams(sequence, **kwargs)
    Return the bigrams generated from a sequence of items, as an iterator.
    For example:
    
        >>> from nltk.util import bigrams
        >>> list(bigrams([1,2,3,4,5]))
        [(1, 2), (2, 3), (3, 4), (4, 5)]
    
    Use bigrams for a list version of this function.
    
    :param sequence: the source data to be converted into bigrams
    :type sequence: sequence or iter
    :rtype: iter(tuple)



In [2]:
t1_bigrams = nltk.bigrams(text1[:10])

# to print bigrams, convert it to a list of tuples
list(t1_bigrams)

[('[', 'Moby'),
 ('Moby', 'Dick'),
 ('Dick', 'by'),
 ('by', 'Herman'),
 ('Herman', 'Melville'),
 ('Melville', '1851'),
 ('1851', ']'),
 (']', 'ETYMOLOGY'),
 ('ETYMOLOGY', '.')]

In [4]:
# we could make our own function to do this
def bigrams(str_list):
    bigram_list = []
    # for i, word in enumerate(str_list[:-1]):
    #     bigram_list.append((word, str_list[i+1]))
    # without indexing using zip
    # we go through two lists at the same time
    # in this case we go through the list and the list shifted by one
    # very Pythonic
    for word1, word2 in zip(str_list[:-1], str_list[1:]):
        bigram_list.append((word1, word2))
    return bigram_list
# let's try it out
bigrams(text1[:10])

[('[', 'Moby'),
 ('Moby', 'Dick'),
 ('Dick', 'by'),
 ('by', 'Herman'),
 ('Herman', 'Melville'),
 ('Melville', '1851'),
 ('1851', ']'),
 (']', 'ETYMOLOGY'),
 ('ETYMOLOGY', '.')]

To find **collocations**, we want to find bigrams that occur more often than we would expect based on the frequency of the individual words.

Additional information:
* [nltk.text.Text](https://www.nltk.org/api/nltk.html#nltk.text.Text)
* [NLTK documentation: collocations](https://www.nltk.org/api/nltk.html#module-nltk.collocations)

In [5]:
coll_list = text1.collocation_list()
# so this list of tuples will be the list of bigrams which are most specific to Moby Dick
# that is these words wouldnt be so often together if the text was just randomly shuffled
coll_list

[('Sperm', 'Whale'),
 ('Moby', 'Dick'),
 ('White', 'Whale'),
 ('old', 'man'),
 ('Captain', 'Ahab'),
 ('sperm', 'whale'),
 ('Right', 'Whale'),
 ('Captain', 'Peleg'),
 ('New', 'Bedford'),
 ('Cape', 'Horn'),
 ('cried', 'Ahab'),
 ('years', 'ago'),
 ('lower', 'jaw'),
 ('never', 'mind'),
 ('Father', 'Mapple'),
 ('cried', 'Stubb'),
 ('chief', 'mate'),
 ('white', 'whale'),
 ('ivory', 'leg'),
 ('one', 'hand')]

## Collocations vs idioms

Ideomatic expressions are a subset of collocations. They are collocations that have a meaning that is not predictable from the meanings of the individual words.

In [6]:
# we can also look for trigram collocations
# collocations that are 3 words in length

# http://www.nltk.org/howto/collocations.html

coll_3 = nltk.collocations.TrigramCollocationFinder.from_words(text1)

trigram_measures = nltk.collocations.TrigramAssocMeasures()
scored = coll_3.score_ngrams(trigram_measures.raw_freq)

sorted(coll_3.nbest(trigram_measures.raw_freq, 15))

[("'", 's', 'a'),
 ("'", 's', 'the'),
 (',', 'and', 'the'),
 (',', 'as', 'if'),
 (',', 'in', 'the'),
 (',', 'then', ','),
 ('.', 'It', 'was'),
 ('.', 'Now', ','),
 ('Ahab', "'", 's'),
 ('don', "'", 't'),
 ('he', "'", 's'),
 ('of', 'the', 'whale'),
 ('ship', "'", 's'),
 ('the', 'Sperm', 'Whale'),
 ('whale', "'", 's')]

Note: in order to find unusual trigrams we would need to filter the results (and pick the most appropriate collocation measure) like it is done in `Text.collocation_list()`. 

Source code for `collocation_list()`: https://www.nltk.org/_modules/nltk/text.html#Text.collocation_list

```
            # print("Building collocations list")
            from nltk.corpus import stopwords

            ignored_words = stopwords.words("english")
            
            finder = BigramCollocationFinder.from_words(self.tokens, window_size)
            finder.apply_freq_filter(2)
            finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in ignored_words)
            
            bigram_measures = BigramAssocMeasures()
            self._collocations = finder.nbest(bigram_measures.likelihood_ratio, num)
```

## Stopwords

Stopwords are words that are filtered out before or after processing of natural language data (text).

Generally they are not needed for analysis purposes.

Exceptions are when we want to find collocations or idioms with stopwords.

In [11]:
# print("Building collocations list")
from nltk.corpus import stopwords

ignored_words = stopwords.words("english")

finder = nltk.collocations.BigramCollocationFinder.from_words(text1.tokens, window_size=2)
finder.apply_freq_filter(2) # filter words which occur at least 2 times
# following filter filters out words of length less than 3 or in the ignored word list
finder.apply_word_filter(lambda w: len(w) < 3 or w.lower() in ignored_words)

bigram_measures = nltk.collocations.BigramAssocMeasures()
my_collocations = finder.nbest(bigram_measures.likelihood_ratio, n=20)

In [12]:
my_collocations

[('Sperm', 'Whale'),
 ('Moby', 'Dick'),
 ('White', 'Whale'),
 ('old', 'man'),
 ('Captain', 'Ahab'),
 ('sperm', 'whale'),
 ('Right', 'Whale'),
 ('Captain', 'Peleg'),
 ('New', 'Bedford'),
 ('Cape', 'Horn'),
 ('cried', 'Ahab'),
 ('years', 'ago'),
 ('lower', 'jaw'),
 ('never', 'mind'),
 ('Father', 'Mapple'),
 ('cried', 'Stubb'),
 ('chief', 'mate'),
 ('white', 'whale'),
 ('ivory', 'leg'),
 ('one', 'hand')]

In [9]:
stopwords.words("spanish")
# alas latvian stop words are not here
# we can find a collection at github
# we could download the latvian stop word list from here
# https://github.com/stopwords-iso/stopwords-lv/blob/master/stopwords-lv.txt

['de',
 'la',
 'que',
 'el',
 'en',
 'y',
 'a',
 'los',
 'del',
 'se',
 'las',
 'por',
 'un',
 'para',
 'con',
 'no',
 'una',
 'su',
 'al',
 'lo',
 'como',
 'más',
 'pero',
 'sus',
 'le',
 'ya',
 'o',
 'este',
 'sí',
 'porque',
 'esta',
 'entre',
 'cuando',
 'muy',
 'sin',
 'sobre',
 'también',
 'me',
 'hasta',
 'hay',
 'donde',
 'quien',
 'desde',
 'todo',
 'nos',
 'durante',
 'todos',
 'uno',
 'les',
 'ni',
 'contra',
 'otros',
 'ese',
 'eso',
 'ante',
 'ellos',
 'e',
 'esto',
 'mí',
 'antes',
 'algunos',
 'qué',
 'unos',
 'yo',
 'otro',
 'otras',
 'otra',
 'él',
 'tanto',
 'esa',
 'estos',
 'mucho',
 'quienes',
 'nada',
 'muchos',
 'cual',
 'poco',
 'ella',
 'estar',
 'estas',
 'algunas',
 'algo',
 'nosotros',
 'mi',
 'mis',
 'tú',
 'te',
 'ti',
 'tu',
 'tus',
 'ellas',
 'nosotras',
 'vosotros',
 'vosotras',
 'os',
 'mío',
 'mía',
 'míos',
 'mías',
 'tuyo',
 'tuya',
 'tuyos',
 'tuyas',
 'suyo',
 'suya',
 'suyos',
 'suyas',
 'nuestro',
 'nuestra',
 'nuestros',
 'nuestras',
 'vuestro'

## Downloading Text from the web

Below recipe can be used to download any text filefrom public web.

```

In [10]:
# download the file from url
# we could use pandas
# but we can also use urllib
import urllib.request # standard library for downloading files from the web
url = "https://raw.githubusercontent.com/stopwords-iso/stopwords-lv/master/stopwords-lv.txt"
response = urllib.request.urlopen(url)
stopwords_lv = response.read().decode("utf-8")
stopwords_lv = stopwords_lv.split("\n")
stopwords_lv[:10]

['aiz',
 'ap',
 'apakš',
 'apakšpus',
 'ar',
 'arī',
 'augšpus',
 'bet',
 'bez',
 'bija']

In [12]:
my_collocations

[('Sperm', 'Whale'),
 ('Moby', 'Dick'),
 ('White', 'Whale'),
 ('old', 'man'),
 ('Captain', 'Ahab'),
 ('sperm', 'whale'),
 ('Right', 'Whale'),
 ('Captain', 'Peleg'),
 ('New', 'Bedford'),
 ('Cape', 'Horn'),
 ('cried', 'Ahab'),
 ('years', 'ago'),
 ('lower', 'jaw'),
 ('never', 'mind'),
 ('Father', 'Mapple'),
 ('cried', 'Stubb'),
 ('chief', 'mate'),
 ('white', 'whale'),
 ('ivory', 'leg'),
 ('one', 'hand')]

In [13]:
type(finder)

nltk.collocations.BigramCollocationFinder