# Sentence Generation with n-grams
## Learning Objective
Here in this assignment, you will create a bigram model from the Brown corpus. You will use Laplace smoothing for bigrams and you will evaluate the perplexity of the model.

In the last portion of the assignment, you will generate sentences from the bigram model.

<b><div style="text-align: right">[TOTAL POINTS: 10]</div></b>

## Assignment Overview

In this assignment you will be assigned to do the following tasks.

* Preprocessig Dataset (lowercasing, removing punctuations, adding start and end tokens)
* Adding unknown tokens
* Creating n-grams and their corresponding counts
* Creating Laplace Smoothing n-gram model
* Calculating Perplexity of the model
* Bonus (Sentence Generation with the model)

## Dataset Description:

**Brown Corpus**

The Brown Corpus is an electronic collection of text samples of American English in a varied genres. It was compiled by W. N. Francis and H. Kucera, Brown University.

The corpus contains one million words of American English sampled from 15 different text categories. This corpus consists of 500 texts from different genres, each consisting of over 2000 words. The different text categories are as follows:

    1. Report (44 texts)
    2. Editorial (27 texts)
    3. Reviews (17 texts)
    4. Religion (17 texts)
    5. Skill and Hobbies (36 texts)
    6. Popular Lore (48 texts)
    7. Belles-Lettres (75 texts)
    8. Government (30 texts)
    9. Learned (80 texts)
    10. Fiction: General (29 texts)
    11. Fiction: Mystery (24 texts)
    12. Fiction: Science (6 texts)
    13. Fiction: Adventure (29 texts)
    14. Fiction: Romance (29 texts)
    15. Humor (9 texts)

*Source:* https://www1.essex.ac.uk/linguistics/external/clmt/w3c/corpus_ling/content/corpora/list/private/brown/brown.html \
*Author: W.N. Francis and H. Kucera, Brown University, Providence, RI*

The Brown corpus is available in [NLTK corpora](http://www.nltk.org/nltk_data/).

In [None]:
!pip install -q nltk

In [None]:
import nltk
nltk.download('brown')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


True

In [None]:
from nltk.corpus import brown
categories = brown.categories()
categories

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

In the train set, you will use sentences from categories `'adventure', 'editorial', 'fiction', 'hobbies', 'humor', 'lore', 'mystery' 'reviews'`.

And for the test set, you will use sentences from categories `'romance'`.

In [None]:
train_lines = brown.sents(categories=['adventure', 'editorial', 'fiction',
                                      'hobbies', 'humor', 'lore', 'mystery' 'reviews',
                                     ])
test_lines = brown.sents(categories=['romance'])

print(f"Training data: \n{train_lines[:10]}")
print("\n\n")
print(f"Test data: \n{test_lines[:10]}")

Training data: 
[['Assembly', 'session', 'brought', 'much', 'good'], ['The', 'General', 'Assembly', ',', 'which', 'adjourns', 'today', ',', 'has', 'performed', 'in', 'an', 'atmosphere', 'of', 'crisis', 'and', 'struggle', 'from', 'the', 'day', 'it', 'convened', '.'], ['It', 'was', 'faced', 'immediately', 'with', 'a', 'showdown', 'on', 'the', 'schools', ',', 'an', 'issue', 'which', 'was', 'met', 'squarely', 'in', 'conjunction', 'with', 'the', 'governor', 'with', 'a', 'decision', 'not', 'to', 'risk', 'abandoning', 'public', 'education', '.'], ['There', 'followed', 'the', 'historic', 'appropriations', 'and', 'budget', 'fight', ',', 'in', 'which', 'the', 'General', 'Assembly', 'decided', 'to', 'tackle', 'executive', 'powers', '.'], ['The', 'final', 'decision', 'went', 'to', 'the', 'executive', 'but', 'a', 'way', 'has', 'been', 'opened', 'for', 'strengthening', 'budgeting', 'procedures', 'and', 'to', 'provide', 'legislators', 'information', 'they', 'need', '.'], ['Long-range', 'planning', 'o

In the following section you are provided `train_sentences` and `test_sentences` as list of lowercased sentences.

In [None]:
train_sentences = [" ".join(sent).lower() for sent in train_lines]
test_sentences = [" ".join(sent).lower() for sent in test_lines]

print(f"Training sentences: \n{train_sentences[:10]}")
print("\n\n")
print(f"Test sentences: \n{test_sentences[:10]}")

Training sentences: 
['assembly session brought much good', 'the general assembly , which adjourns today , has performed in an atmosphere of crisis and struggle from the day it convened .', 'it was faced immediately with a showdown on the schools , an issue which was met squarely in conjunction with the governor with a decision not to risk abandoning public education .', 'there followed the historic appropriations and budget fight , in which the general assembly decided to tackle executive powers .', 'the final decision went to the executive but a way has been opened for strengthening budgeting procedures and to provide legislators information they need .', 'long-range planning of programs and ways to finance them have become musts if the state in the next few years is to avoid crisis-to-crisis government .', 'this session , for instance , may have insured a financial crisis two years from now .', 'in all the turmoil , some good legislation was passed .', 'some other good bills were lo

### Exercise 1 : Remove punctuation and add start and end tokens.
<b><div style="text-align: right">[POINTS: 2]</div></b>

`train_sentences` and `test_sentences` contain lowercased list of sentences. However, they still contain punctuations such as `(".", "?", ",", "'", "-")` etc. Your task is to remove those punctuations. Also, after removing the punctuations, your task is to add start `'<s>'` and end `'</s>'` tokens.

**Tasks:**

* Remove punctuations
* Add start and end tokens to each sentences
* Return the list of sentences with removed puncuations and added start and end tokens.

In [None]:
import string

SOS = "<s> "
EOS = "</s>"

def basic_preprocess(sentences):
    """
    Lowercase all words and remove punctuations.
    And add start and end tokens.

    For example:
    Args:
        senetences(list) : ['this is first sentence .', 'this is second sentence .']
    Returns:
        sents(list): ['<s> this is first sentence </s>', '<s> this is second sentence </s>']
    """

    sents = None
    ### Ex-1-Task-1
    ### BEGIN SOLUTION
    # YOUR CODE HERE
    import re
    sents = []
    for sentence in sentences:
    sents = [SOS + sent.translate(str.maketrans('','',string.punctuation)) + EOS for sent in sentences]
    # raise NotImplementedError()
    ### END SOLUTION

    return sents

In [None]:
# Intentionally Left Blank

In [None]:
processed_train_sentences = basic_preprocess(train_sentences)
processed_test_sentences = basic_preprocess(test_sentences)

### Exercise 2 : Replace words with count = 1 with '< UNK>' token and create word tokens
<b><div style="text-align: right">[POINTS: 2]</div></b>

`processed_train_sentences` and `processed_test_sentences` contain the processed sentences as a list. Your task is to check all the word tokens which appear only once in the corpus and replace them with `'<UNK>'` token. The function should return all of the sequence of words in the corpus in a single list.

For example, if the sample input is:
```
['<s> this is first sentence </s>', '<s> this is second sentence </s>']
```
Here the words `'first'` and `'second'` appear only once in the corpus.
The output should be:
```
['<s>', 'this', 'is', '<UNK>', 'sentence', '</s>', '<s>', 'this', 'is', '<UNK>', 'sentence', '</s>']
```


**Task:**

*  Replace word counts=1 in the corpus with < UNK> token and create individual word tokens.

In [None]:
UNK = "<UNK>"

def generate_tokens(sentences):
    """
    Takes a list of sentences with start and end tokens.
    The function should replace the words which occur only once in the corpus with
    '<UNK>' token and return the list of all tokens.
    For example:
    Args:
        sentences(list):
        ['<s> this is first sentence </s>', '<s> this is second sentence </s>']

    Returns:
        tokens_with_unk(list):
        ['<s>', 'this', 'is', '<UNK>', 'sentence', '</s>', '<s>', 'this', 'is', '<UNK>', 'sentence', '</s>']

    """
    tokens = " ".join(sentences).split()
    vocab = nltk.FreqDist(tokens)

    tokens_with_unk = None
    ### Ex-2-Task-1
    ### BEGIN SOLUTION
    from collections import Counter
    word_counts = Counter(tokens)
    corpus_with_unk = ' '.join(['<UNK>' if word_counts[word] == 1 else word for word in tokens])
    tokens_with_unk = corpus_with_unk.split()
    # YOUR CODE HERE
    ### END SOLUTION

    return tokens_with_unk

In [None]:
# Intentionally Left Blank


In [None]:
train_tokens = generate_tokens(processed_train_sentences)
test_tokens = generate_tokens(processed_train_sentences)

### Exercise 3 : Create n-grams and get their counts
<b><div style="text-align: right">[POINTS: 2]</div></b>

Now, it's time to create n-grams from the tokens generated from Exercise 2. Your task is to return unique n-grams with their corresponding counts.

**Task:**
* Create n-grams and return unique n-grams with their corresponding counts.

Hint: `nltk.ngrams()` and `nltk.FreqDist()` functions may be helpful.

In [None]:
def ngrams(tokens, n=2):
    """
    Create n-grams and return unique n-grams with their corresponding counts.

    Args:
        tokens (list): list of tokens
        n(int) = 1 for unigram, 2 for bigram

    Returns:
    n-grams(dict): dictionary of n-grams as a tuple and it's corresponding count.

    Example:
        tokens = ['<s>', 'this', 'is', '<UNK>', 'sentence', '</s>',
                '<s>', 'this', 'is', '<UNK>', 'sentence', '</s>']
        For n = 1,

        n_grams:{
                ('<s>',): 2,
                ('this',): 2,
                ('is',): 2,
                ('<UNK>',): 2,
                ('sentence',): 2,
                ('</s>',): 2
                }

        For n = 2,

        n_grams: {
                ('<s>', 'this') : 2,
                ('this', 'is') : 2,
                ('is', '<UNK>') : 2,
                ('<UNK>', 'sentence') : 2,
                ('</s>' '<s>') : 1,
                ('sentence', '</s>') : 2
                }
    """
    ngram_dicts = None
    ### Ex-3-Task-1
    ### BEGIN SOLUTION
    # YOUR CODE HERE
    from nltk import ngrams, FreqDist
    n_grams = list(ngrams(tokens, n))
    ngram_freq = FreqDist(n_grams)
    ngram_dicts = dict(ngram_freq)
    # raise NotImplementedError()
    ### END SOLUTION

    return ngram_dicts

In [None]:
# Intentionally Left Blank


In [None]:
n = 2

bigram_dicts = ngrams(train_tokens, n)
unigram_dicts = ngrams(train_tokens, n-1)

In [None]:
vocab = nltk.FreqDist(train_tokens)
vocab_size = len(vocab)
vocab_size

15086

### Exercise 4 : Laplace Smoothing
<b><div style="text-align: right">[POINTS: 3]</div></b>

We know that the Laplace smoothing for bigram is given as:

$$
P_{Laplace}^{*}(w_n|w_{n-1}) = \frac{\text{count}(w_{n-1}w_n) + 1}{\text{count}(w_{n-1}) + V}
$$

Here, $w_{n-1}$ is the previous word and $w_n$ is the present word of the bigram. Also, $V$ is the vocab size of the corpus.

For eg:
$$P_{Laplace}^{*}\text{("great"| "the")} = \frac{\text{count("the", "great") + 1}}{\text{count("the")} + \text{vocab_size}}$$

**Task:**
* Apply laplace smoothing for a bigram

In [None]:
def smoothed_bigram_prob(bigram, bigram_count, unigram_dicts, vocab_size):

    """
    Args:
        bigram (a tuple): a tuple of bigrams for ex: ('the', 'great')
        bigram_count(int): count of bigram
        unigram_dicts: dictionary containing unigrams and their corresponding counts
        vocab_size: vocab size of the corpus

    Returns:
        smoothed_prob(float): Smoothed probability of the bigram.
    """

    unigram = None
    unigram_count = None
    smoothed_prob = None
    ### Ex-4-Task-1
    ### BEGIN SOLUTION
    # YOUR CODE HERE
    unigram = list(nltk.ngrams(bigram,1))
    unigram_count = unigram_dicts[unigram[0]]
    smoothed_prob = (bigram_count + 1) / (unigram_count + vocab_size)
    # raise NotImplementedError()
    ### END SOLUTION
    return smoothed_prob

In [None]:
# Intentionally Left Blank


In [None]:
def smoothing(bigram_dicts):
    """
    Args:
        bigram_dicts (dict): dictionary items containing bigram tuple and their corresponding count.

    Returns:
        (dict) : dictionary items containing bigram tuple and thier smoothed probability.
    """
    return { n_gram: smoothed_bigram_prob(n_gram, count, unigram_dicts, vocab_size) \
            for n_gram, count in bigram_dicts.items() }

In [None]:
model = smoothing(bigram_dicts)
sorted(model.items(), key=lambda x: x[1], reverse=True)[:20]

[(('</s>', '<s>'), 1.4589685801405277),
 (('of', 'the'), 0.2021742012461885),
 (('<s>', 'the'), 0.16200450749038844),
 (('in', 'the'), 0.13568871801670424),
 (('the', '<UNK>'), 0.09909850192231208),
 (('<UNK>', '</s>'), 0.09770648283176454),
 (('<s>', 'he'), 0.0906138141323081),
 (('to', 'the'), 0.08086968049847541),
 (('on', 'the'), 0.06370144504838923),
 (('<UNK>', 'and'), 0.0556144769985417),
 (('<UNK>', '<UNK>'), 0.05362587829775951),
 (('<s>', 'it'), 0.05216757258385258),
 (('and', 'the'), 0.050178973883070396),
 (('a', '<UNK>'), 0.04971496751955455),
 (('<s>', 'i'), 0.04746122232533475),
 (('and', '<UNK>'), 0.04480975739095851),
 (('at', 'the'), 0.042158292456582265),
 (('for', 'the'), 0.041429139599628795),
 (('<UNK>', 'of'), 0.039506827522206016),
 (('<s>', 'but'), 0.0382473816783773)]

### Exercise 5 : Calculate perplexity.
<b><div style="text-align: right">[POINTS: 1]</div></b>

We know the perplexity of the test set for bigram is given as:
$$
PP(S) = \sqrt[N]{\frac{1}{\prod \limits_{i=1}^{N} P(w_i | w_{i-1})}}    \tag{1}
$$

i.e
$$
PP(S) = ({\prod \limits_{i=1}^{N} P(w_i | w_{i-1})})^{\frac{-1}{N}} \tag{2}
$$

Take log on both sides:
$$
log(PP(S)) = {-\frac{1}{N}} \sum \limits_{i=1}^{N}{log(P(w_i | w_{i-1}))} \tag{3}
$$


So, the $PP(S)$ is exponential of sum of log probabilities, normalized by the number of tokens in the test set. \
i.e.
$$
PP(S) = \exp(-{\frac{1}{N}} \sum \limits_{i=1}^{N}{log(P(w_i | w_{i-1}))})  \tag{4}
$$



**Task:**
* You are given the probabilities of the test set in a list. Calculate the perplexity of the test set, using equation (4).

In [None]:
masks = [[1,1], [1, 0], [0, 1], [0, 0]]

def convert_oov(ngram):
    """Converts, if necessary, a given n-gram to one which is known by the model.
    Args:
        ngram (tuple): a bigram tuple. for ex: ("the", "great")
    Returns:
        The n-gram with <UNK> tokens in certain positions such that the model
        contains an entry for it.

    """
    mask = lambda ngram, bitmask: tuple((token if flag == 1 else "<UNK>" for token,flag in zip(ngram, bitmask)))

    ngram = (ngram,) if type(ngram) is str else ngram
    for possible_known in [mask(ngram, bitmask) for bitmask in masks]:
        if possible_known in model:
            return possible_known

In [None]:
test_ngrams = nltk.ngrams(test_tokens, 2)
N = len(test_tokens)
known_ngrams  = (convert_oov(ngram) for ngram in test_ngrams)
probs = [model[ngram] for ngram in known_ngrams]

In [None]:
import math
def perplexity(prob, N):
    """
    Args:
        probs(list): list of test probabilities.
        N(int): Number of tokens in the test set.

    Returns:
        perplexity(float): Perplexity of the model in the test set.
    """
    perplexity = None
    ### Ex-5-Task-1
    ### BEGIN SOLUTION
    # YOUR CODE HERE
    perplexity = math.exp((-1/N) * sum([math.log(p) for p in prob]) )
    # raise NotImplementedError()
    ### END SOLUTION
    return perplexity

In [None]:
# Intentionally Left Blank


In [None]:
pps = perplexity(probs, N)
print(f"Perplexity of the model is: {pps}")

Perplexity of the model is: 0.000997781885014523


## Sentence Generation with n-grams
Now that our bigram model is ready, let's generate some sample sentences from the model.

In [None]:

import math
import random
def best_candidate(prev, i, without=[]):
    """Choose the most likely next token given the previous (n-1) tokens.
    Args:
        prev (tuple of str): the previous n-1 tokens of the sentence.
        i (int): which candidate to select if not the most probable one.
        without (list of str): tokens to exclude from the candidates list.
    Returns:
        A tuple with the next most probable token and its corresponding probability.

    """

    blacklist  = ["<UNK>"] + without
    candidates = ((ngram[-1], prob) for ngram, prob in model.items() if ngram[:-1]==prev)
    candidates = filter(lambda candidate: candidate[0] not in blacklist, candidates)
    candidates = sorted(candidates, key=lambda candidate: candidate[1], reverse=True)

    if len(candidates) == 0:
        return ("</s>", 1)
    else:
        candidate_index = int((random.randint(0, len(candidates)))/2)
        return candidates[candidate_index if prev != () and prev[-1] != "<s>" else i]

def generate_sentences(num, min_len=12, max_len=24):
    """Generate random sentences using the language model.
    Args:
        num (int): the number of sentences to generate.
        min_len (int): minimum allowed sentence length.
        max_len (int): maximum allowed sentence length.
    Yields:
        A tuple with the generated sentence and the combined probability
        (in log-space) of all of its n-grams.

    """
    for i in range(num):
        sent, prob = ["<s>"], 1
        while sent[-1] != "</s>":
            prev = tuple(sent[-(1):])
            blacklist = sent + (["</s>"] if len(sent) < min_len else [])
            next_token, next_prob = best_candidate(prev, i, without=blacklist)
            sent.append(next_token)
            prob *= next_prob

            if len(sent) >= max_len:
                sent.append("</s>")

        yield ' '.join(sent), -1/math.log(prob)

In [None]:
print("Generating sentences...")
for sentence, prob in generate_sentences(num = 10):
    print("{} ({:.5f})".format(sentence, prob))

Generating sentences...
<s> the transmutation of including new game whitetail hunter this kind mr america was quiet and bend handle a wrinkled heavy weight in crisp </s> (0.00522)
<s> he declared his records because were given number on behind you did </s> (0.00955)
<s> it stands in jeopardy </s> (0.03661)
<s> i cant expect you stress and postponed only on toward full communism established that feeding lowmoisture corn in individuals to germany it doesnt </s> (0.00526)
<s> but pansy seeds so then installed over in jerusalem why arent the competition at cypress swamp </s> (0.00703)
<s> in only to note however it floating ice cream six times of 1927 </s> (0.00891)
<s> a waterfront shouting there evidence the boss was tried her breasts suffocating one morning classes riding down early stages of computing compression ratio </s> (0.00514)
<s> and bow down the towns perched on weekends enable us all teachers are realizing that boy asserted would reduce their days she took </s> (0.00528)
<s> 

Here in this assignment, we build a bigram model using Laplace smoothing techniques and we calculated perplexity of the model in the test set.

Then, we generated some sample sentences from the bigram model.

---

**Congratulations for successfully completing the assignment**.

Good Luck going forward with the course.
See you in the next chapter.