# NLP Assignment 2

## N-Gram Model

In this assignment you will use the N-Gram notebook provided in the class.

• Use the notebook and implement Laplace smoothing for the N-Gram model to handle 0 counts of the word.

• After implementation of the smoothing, also show the impact by displaying some examples.

In [None]:
import nltk
from nltk.corpus import brown
from nltk.tokenize import word_tokenize
from collections import Counter

nltk.download('brown')
nltk.download('punkt')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
# Loading the corpus
corpus = brown.words()

# Case folding and getting vocab
lower_case_corpus = [w.lower() for w in corpus]
vocab = set(lower_case_corpus)

print('Total words in Corpus:', len(lower_case_corpus))
print('Vocab of the Corpus:', len(vocab))

Total words in Corpus: 1161192
Vocab of the Corpus: 49815


In [None]:
# Counting bigrams and trigrams using Counter
bigram_counts = Counter(zip(lower_case_corpus, lower_case_corpus[1:]))
trigram_counts = Counter(zip(lower_case_corpus, lower_case_corpus[1:], lower_case_corpus[2:]))

print("Example, count for bigram ('the', 'king') is: " + str(bigram_counts[('the', 'king')]))

Example, count for bigram ('the', 'king') is: 51


In [None]:
# Function takes sentence as input and suggests possible words that comes after the sentence
def suggest_next_word(input_, bigram_counts, trigram_counts, vocab):
    # Splitting the input into tokens
    tokens = input_.lower().split()

    # Consider the last bigram of the sentence
    last_bigram = (tokens[-2], tokens[-1])

    # Laplace smoothing parameter
    alpha = 1

    # Calculating probability for each word in vocab with Laplace smoothing
    vocab_probabilities = {}
    for vocab_word in vocab:
        test_trigram = (last_bigram[0], last_bigram[1], vocab_word)
        test_bigram = (last_bigram[0], last_bigram[1])

        test_trigram_count = trigram_counts.get(test_trigram, 0)
        test_bigram_count = bigram_counts.get(test_bigram, 0)

        # Laplace smoothing
        probability = (test_trigram_count + alpha) / (test_bigram_count + alpha * len(vocab))
        vocab_probabilities[vocab_word] = probability

    # Sorting the vocab probability in descending order to get top probable words
    top_suggestions = sorted(vocab_probabilities.items(), key=lambda x: x[1], reverse=True)[:3]
    return top_suggestions

In [None]:
# Example sentences to demonstrate the impact of Laplace smoothing
examples = [
    'I am the king',
    'I am the king of',
    'I am the king of france',
    'I am the king of france and',
    'This is a completely new sentence without any previous context'
]

for example in examples:
    suggestions = suggest_next_word(example, bigram_counts, trigram_counts, vocab)
    print(f"\nExample: '{example}'")
    print("Suggestions:")
    for word, probability in suggestions:
        print(f"  {word}: {probability:.5f}")


Example: 'I am the king'
Suggestions:
  james: 0.00020
  of: 0.00018
  arthur: 0.00014

Example: 'I am the king of'
Suggestions:
  france: 0.00010
  hearts: 0.00006
  england: 0.00004

Example: 'I am the king of france'
Suggestions:
  and: 0.00010
  .: 0.00010
  ,: 0.00008

Example: 'I am the king of france and'
Suggestions:
  the: 0.00008
  germany: 0.00006
  louisiana: 0.00004

Example: 'This is a completely new sentence without any previous context'
Suggestions:
  constables: 0.00002
  dunn's: 0.00002
  shawano: 0.00002
