### Updating and checking the NLTK version

In [None]:
!pip install -U pip
!pip install -U dill
# !pip install -U nltk==3.7

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pip
  Downloading pip-23.0.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 22.0.4
    Uninstalling pip-22.0.4:
      Successfully uninstalled pip-22.0.4
Successfully installed pip-23.0.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[0m

In [None]:
import nltk
print(nltk.__version__)

3.7


# N-gram using NLTK

Traditionally, we can use n-grams to generate language models to predict which word comes next given a history of words.

We'll use the lm module in nltk to get a sense of how non-neural language modelling is done.

In [None]:
from nltk.util import bigrams
from nltk.util import ngrams
from nltk.util import everygrams
from nltk.util import pad_sequence
from nltk.lm.preprocessing import pad_both_ends
# from nltk.lm.preprocessing import flatten

If we want to train a bigram model, we need to turn this text into bigrams. Here's what the first sentence of our text would look like if we use the ngrams function from NLTK for this.

In [None]:
nltk.download('punkt')
text = "I am learning Text Analytics"
tokens = nltk.tokenize.word_tokenize(text.lower())
tokens

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['i', 'am', 'learning', 'text', 'analytics']

In [None]:
list(bigrams(tokens))

[('i', 'am'), ('am', 'learning'), ('learning', 'text'), ('text', 'analytics')]

In [None]:
list(ngrams(tokens, n=3)) # n = no of grams

[('i', 'am', 'learning'),
 ('am', 'learning', 'text'),
 ('learning', 'text', 'analytics')]

In [None]:
list(everygrams(tokens, max_len=3)) # max_len will set the no of maximum grams

[('i',),
 ('i', 'am'),
 ('i', 'am', 'learning'),
 ('am',),
 ('am', 'learning'),
 ('am', 'learning', 'text'),
 ('learning',),
 ('learning', 'text'),
 ('learning', 'text', 'analytics'),
 ('text',),
 ('text', 'analytics'),
 ('analytics',)]

Add special "padding" symbols to the sentence before splitting it into ngrams. Fortunately, NLTK also has a function for that, let's see what it does to the first sentence.

In [None]:
from nltk.util import pad_sequence
list(pad_sequence(tokens, pad_left=True, left_pad_symbol="<s>", pad_right=True, right_pad_symbol="</s>", n=2)) 
# The n order of n-grams, if it's 2-grams, you pad once, 3-grams pad twice, etc. 

['<s>', 'i', 'am', 'learning', 'text', 'analytics', '</s>']

In [None]:
padded_sent = list(pad_sequence(tokens, pad_left=True, left_pad_symbol="<s>", pad_right=True, right_pad_symbol="</s>", n=2))
list(ngrams(padded_sent, n=2)) # bigram

[('<s>', 'i'),
 ('i', 'am'),
 ('am', 'learning'),
 ('learning', 'text'),
 ('text', 'analytics'),
 ('analytics', '</s>')]

Note the n argument, that tells the function we need padding for bigrams.

Now, passing all these parameters every time is tedious and in most cases they can be safely assumed as defaults anyway.

Thus the nltk.lm module provides a convenience function that has all these arguments already set while the other arguments remain the same as for pad_sequence.

In [None]:
from nltk.lm.preprocessing import pad_both_ends
list(pad_both_ends(tokens, n=3))

['<s>', '<s>', 'i', 'am', 'learning', 'text', 'analytics', '</s>', '</s>']

Combining the two parts discussed so far we get the following preparation steps for one sentence.

In [None]:
list(bigrams(pad_both_ends(tokens, n=2)))

[('<s>', 'i'),
 ('i', 'am'),
 ('am', 'learning'),
 ('learning', 'text'),
 ('text', 'analytics'),
 ('analytics', '</s>')]

To make our model more robust we could also train it on unigrams (single words) as well as bigrams, its main source of information. NLTK once again helpfully provides a function called everygrams.

While not the most efficient, it is conceptually simple.

In [None]:
from nltk.util import everygrams
padded_bigrams = list(pad_both_ends(tokens, n=2))
list(everygrams(padded_bigrams, max_len=1))

[('<s>',),
 ('i',),
 ('am',),
 ('learning',),
 ('text',),
 ('analytics',),
 ('</s>',)]

In [None]:
list(everygrams(padded_bigrams, max_len=2))

[('<s>',),
 ('<s>', 'i'),
 ('i',),
 ('i', 'am'),
 ('am',),
 ('am', 'learning'),
 ('learning',),
 ('learning', 'text'),
 ('text',),
 ('text', 'analytics'),
 ('analytics',),
 ('analytics', '</s>'),
 ('</s>',)]

During training and evaluation our model will rely on a vocabulary that defines which words are "known" to the model.

To create this vocabulary we need to pad our sentences (just like for counting ngrams) and then combine the sentences into one flat stream of words.

### Calculating probability of n-grams in a text / sentences

In [None]:
import nltk
nltk.download('punkt')
text = "I am learning Text Analytics"
# Tokenize the text.
tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(text)))]
print(tokenized_text)

[['i', 'am', 'learning', 'text', 'analytics']]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
# Preprocess the tokenized text for 3-grams language modelling
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE # Maximum Likelihood Estimation

n = 3
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)

model = MLE(n) # Lets train a 3-grams maximum likelihood estimation model.
model.fit(train_data, padded_sents) # model building

In [None]:
model

<nltk.lm.models.MLE at 0x7fedfc0bf850>

To get the counts:

In [None]:
model.counts['i'] # i.e. Count('i')

1

In [None]:
model.counts[['i']]['am'] # i.e. Count('am'|'i')

1

In [None]:
model.counts[['i', 'am']]['learning'] # i.e. Count('learning'|'i am')

1

To get the Probablity Values:

In [None]:
model.score('am', 'i'.split())  # P('am'|'i') = C(i am)/C(i) = 1/1 = 1

1.0

In [None]:
model.score('learning', 'i am'.split())  # P('learning'|'i am')

1.0

In [None]:
len(model.vocab)

8

In [None]:
model.score("i") # p(i) = c(i)/c(w)
# tokens = 5 & pads = 4 ==> total = 9
# c(i) = 1 & c(w) = 9

0.1111111111111111

In [None]:
model.vocab.lookup(tokenized_text)

(('i', 'am', 'learning', 'text', 'analytics'),)

In [None]:
model.vocab.lookup(["i am playing".split()])

(('i', 'am', '<UNK>'),)

In [None]:
model.counts[['i', 'am']]['playing'] # i.e. Count('playing'|'i am')

0

**Laplace Smoothing using NLTK**

In [None]:
from nltk.lm import Laplace

n = 3
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)

model_laplace = Laplace(n) # Lets train a 3-grams maximum likelihood estimation model.
model_laplace.fit(train_data, padded_sents)

In [None]:
model_laplace

<nltk.lm.models.Laplace at 0x7fedfc0b31f0>

In [None]:
model_laplace.counts[['i']]['am']

1

In [None]:
model_laplace.score('am', 'i'.split())

0.2222222222222222

## N-gram using NLTK

In [None]:
import nltk
from nltk.util import ngrams
 
# Function to generate n-grams from sentences.
def extract_ngrams(data, num):
    n_grams = ngrams(nltk.word_tokenize(data), num)
    return [ ' '.join(grams) for grams in n_grams]
 
text = 'A class is a blueprint for the object.'
 
print("1-gram: ", extract_ngrams(text, 1))
print("2-gram: ", extract_ngrams(text, 2))
print("3-gram: ", extract_ngrams(text, 3))
print("4-gram: ", extract_ngrams(text, 4))

1-gram:  ['A', 'class', 'is', 'a', 'blueprint', 'for', 'the', 'object', '.']
2-gram:  ['A class', 'class is', 'is a', 'a blueprint', 'blueprint for', 'for the', 'the object', 'object .']
3-gram:  ['A class is', 'class is a', 'is a blueprint', 'a blueprint for', 'blueprint for the', 'for the object', 'the object .']
4-gram:  ['A class is a', 'class is a blueprint', 'is a blueprint for', 'a blueprint for the', 'blueprint for the object', 'for the object .']


## N-gram using TextBlob

In [None]:
from textblob import TextBlob
 
# Function to generate n-grams from sentences.
def extract_ngrams(data, num):
    n_grams = TextBlob(data).ngrams(num)
    return [ ' '.join(grams) for grams in n_grams]
 
text = 'A class is a blueprint for the object.'
 
print("1-gram: ", extract_ngrams(text, 1))
print("2-gram: ", extract_ngrams(text, 2))
print("3-gram: ", extract_ngrams(text, 3))
print("4-gram: ", extract_ngrams(text, 4))

1-gram:  ['A', 'class', 'is', 'a', 'blueprint', 'for', 'the', 'object']
2-gram:  ['A class', 'class is', 'is a', 'a blueprint', 'blueprint for', 'for the', 'the object']
3-gram:  ['A class is', 'class is a', 'is a blueprint', 'a blueprint for', 'blueprint for the', 'for the object']
4-gram:  ['A class is a', 'class is a blueprint', 'is a blueprint for', 'a blueprint for the', 'blueprint for the object']
