# Deriving n-grams from Text

Based on [N-Gram Based Text Categorization: Categorizing Text With Python by Alejandro Nolla](http://blog.alejandronolla.com/2013/05/20/n-gram-based-text-categorization-categorizing-text-with-python/)

What are n-grams? See [here](https://cloudmark.github.io/Language-Detection/)

## 1. Tokenization

In [2]:
s = "Le temps est un grand maître, dit-on, le malheur est qu'il tue ses élèves."
s = s.lower()

In [4]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[a-zA-Z'`éèî]+")
s_tokenized = tokenizer.tokenize(s)
s_tokenized

['le',
 'temps',
 'est',
 'un',
 'grand',
 'maître',
 'dit',
 'on',
 'le',
 'malheur',
 'est',
 "qu'il",
 'tue',
 'ses',
 'élèves']

In [5]:
from nltk.util import ngrams
generated_4grams = []

for word in s_tokenized:
    generated_4grams.append(list(ngrams(word, 4, pad_left=True, pad_right=True, left_pad_symbol="_", right_pad_symbol="_")))
generated_4grams

[[('_', '_', '_', 'l'),
  ('_', '_', 'l', 'e'),
  ('_', 'l', 'e', '_'),
  ('l', 'e', '_', '_'),
  ('e', '_', '_', '_')],
 [('_', '_', '_', 't'),
  ('_', '_', 't', 'e'),
  ('_', 't', 'e', 'm'),
  ('t', 'e', 'm', 'p'),
  ('e', 'm', 'p', 's'),
  ('m', 'p', 's', '_'),
  ('p', 's', '_', '_'),
  ('s', '_', '_', '_')],
 [('_', '_', '_', 'e'),
  ('_', '_', 'e', 's'),
  ('_', 'e', 's', 't'),
  ('e', 's', 't', '_'),
  ('s', 't', '_', '_'),
  ('t', '_', '_', '_')],
 [('_', '_', '_', 'u'),
  ('_', '_', 'u', 'n'),
  ('_', 'u', 'n', '_'),
  ('u', 'n', '_', '_'),
  ('n', '_', '_', '_')],
 [('_', '_', '_', 'g'),
  ('_', '_', 'g', 'r'),
  ('_', 'g', 'r', 'a'),
  ('g', 'r', 'a', 'n'),
  ('r', 'a', 'n', 'd'),
  ('a', 'n', 'd', '_'),
  ('n', 'd', '_', '_'),
  ('d', '_', '_', '_')],
 [('_', '_', '_', 'm'),
  ('_', '_', 'm', 'a'),
  ('_', 'm', 'a', 'î'),
  ('m', 'a', 'î', 't'),
  ('a', 'î', 't', 'r'),
  ('î', 't', 'r', 'e'),
  ('t', 'r', 'e', '_'),
  ('r', 'e', '_', '_'),
  ('e', '_', '_', '_')],
 [('_', '_

It seems that `generated_4grams` needs flattening since it's supposed to be a list of 4-grams:

In [6]:
generated_4grams = [word for sublist in generated_4grams for word in sublist]
generated_4grams[:10]

[('_', '_', '_', 'l'),
 ('_', '_', 'l', 'e'),
 ('_', 'l', 'e', '_'),
 ('l', 'e', '_', '_'),
 ('e', '_', '_', '_'),
 ('_', '_', '_', 't'),
 ('_', '_', 't', 'e'),
 ('_', 't', 'e', 'm'),
 ('t', 'e', 'm', 'p'),
 ('e', 'm', 'p', 's')]

## 2. Obtaining n-grams (n=4)

In [8]:
ng_list_4grams = generated_4grams
for idx, val in enumerate(generated_4grams):
    ng_list_4grams[idx] = ''.join(val)
ng_list_4grams[:10]

['___l',
 '__le',
 '_le_',
 'le__',
 'e___',
 '___t',
 '__te',
 '_tem',
 'temp',
 'emps']

## 3. Sorting n-grams by frequency (n=4)

In [10]:
freq_4grams = {}

for ngram in ng_list_4grams:
    if ngram not in freq_4grams:
        freq_4grams.update({ngram: 1})
    else:
        ngram_ocurrences = freq_4grams[ngram]
        freq_4grams.update({ngram: ngram_ocurrences + 1})

from operator import itemgetter # The operator module exports a set of efficient functions corresponding to the intrinsic
# operators of Python. For example, operator.add(x, y) is equivalent to the expression x + y.

freq_4grams_sorted = sorted(freq_4grams.items(), key=itemgetter(1), reverse=True)[0:300]
# We only keep the 300 most popular n-grams. This was suggested in the original paper written about n-grams.
freq_4grams_sorted[:10]

[('e___', 4),
 ('s___', 3),
 ('t___', 3),
 ('___l', 2),
 ('__le', 2),
 ('_le_', 2),
 ('le__', 2),
 ('___t', 2),
 ('___e', 2),
 ('__es', 2)]

## 4. Obtaining n-grams for multiple values of n

To get n-grams for n = 1, 2, 3 and 4 we can use:

In [11]:
from nltk import everygrams

s_clean = ' '.join(s_tokenized) # For the code below we need the raw sentence as opposed to the tokens.
s_clean

"le temps est un grand maître dit on le malheur est qu'il tue ses élèves"

In [13]:
def ngram_extractor(sent):
    return [''.join(ng) for ng in everygrams(sent.replace(" ", "_ _"), 1, 4)
            if " " not in ng and "\n" not in ng and ng != ("_", )]

ngram_extractor(s_clean)[:10]

['l', 'le', 'le_', 'e', 'e_', '_t', '_te', '_tem', 't', 'te']