#Exercise: Text Preprocessing with NLTK
Recall that n-grams is a language model that allows us to tokenize sequences of words in a corpora. This is the basis of the language model where we eventually calculate the probability that a word will appear given a history of words.

## 1. Load relevant libraries and download packages

In [1]:
import random
import nltk
from nltk.util import ngrams
from nltk.lm import MLE
from nltk.util import pad_sequence
from nltk.util import bigrams
from nltk.util import ngrams
from nltk.util import everygrams
from nltk.lm.preprocessing import pad_both_ends, padded_everygram_pipeline
from nltk.lm.preprocessing import flatten
nltk.download('punkt')
nltk.download('stopwords')




[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## 2. Preprocess the sentences

First Tokenize, then add padding.

In [2]:
n = 3

# Step 1: Preprocessing
corpus = [
    "I love to eat pizza.",
    "I love to play soccer.",
    "I love to read books.",
    "I love to create algorithms.",
    "I hate to develop in Java, but sometimes I enjoy it on a Sunday with a coffee.",

]




In natural language processing (NLP), padding is a technique used to ensure that all sequences or sentences in a dataset have the same length. It involves adding special tokens or symbols to the beginning or end of a sequence to make it match the desired length. The n order of n-grams, if it's 2-grams, you pad once, 3-grams pad twice, etc.


In [3]:
# Tokenize the text
tokenized_corpus = [nltk.word_tokenize(sentence.lower()) for sentence in corpus]
train_data, padded_sents = padded_everygram_pipeline(n,tokenized_corpus)
print(padded_sents)

<itertools.chain object at 0x000001FCD884EE90>


## 3. Create the n-gram model


In the context of NLP, flattening refers to the process of converting a nested list or sequence into a single flat list. It involves combining all the elements from the nested structure into a single-level list.

In [4]:
model = MLE(n) # Lets train a 3-grams model, previously we set n=3
model.fit(train_data, padded_sents)

In [5]:
from nltk.tokenize.treebank import TreebankWordDetokenizer

detokenize = TreebankWordDetokenizer().detokenize

def generate_sent(model, num_words, random_seed=42):
    """
    :param model: An ngram language model from `nltk.lm.model`.
    :param num_words: Max no. of words to generate.
    :param random_seed: Seed value for random.
    """
    content = []
    for token in model.generate(num_words, random_seed=random_seed):
        if token == '<s>':
            continue
        if token == '</s>':
            break
        content.append(token)
    return detokenize(content)

## 4. Generate your test your result

In [6]:
generate_sent(model, num_words=20, random_seed=42)

'i enjoy it on a sunday with a coffee.'