<a href="https://colab.research.google.com/github/UdayG01/Ngrams-LM/blob/main/NgramLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Ngram LM

<h5>Aim:</h5>

* Here, I am trying to implement ngrams and how they help in generating text.

<h5>Conclusions: </h5>

* I faced some doubts as what exactly are "corpus reader classes" and how can we calculate the effects that would occur if I use not just bigrams or trigrams, but something of the higher order.
* While some executions of the 'suggest_next_word()' function gave out acceptable predictions, there were a few where I faced a ZeroDivisionError - this leads to the conclusion that there were no bigrams in the corpus that were matching the input phrase as provided in the function.


In [None]:
from nltk.corpus import brown
from nltk.tokenize import word_tokenize

In [None]:
import nltk
nltk.download('brown')
nltk.download('punkt')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# Loading the corpus
corpus = brown.words()

# Case folding and getting vocab
lower_case_corpus = [w.lower() for w in corpus]
vocab = set(lower_case_corpus)

print('CORPUS EXAMPLE: ' + str(lower_case_corpus[:30]) + '\n\n')
print('VOCAB EXAMPLE: ' + str(list(vocab)[:10]))

CORPUS EXAMPLE: ['the', 'fulton', 'county', 'grand', 'jury', 'said', 'friday', 'an', 'investigation', 'of', "atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.', 'the', 'jury', 'further', 'said', 'in']


VOCAB EXAMPLE: ['penna.', 'expects', 'defying', 'sighting', 'word', 'competitors', 'escutcheons', 'seasonally', 'jail', 'bmews']


In [None]:
print('Total words in Corpus: ' + str(len(lower_case_corpus)))
print('Vocab of the Corpus: ' + str(len(vocab)))

Total words in Corpus: 1161192
Vocab of the Corpus: 49815


In [None]:
bigram_counts = {}
trigram_counts = {}

# Sliding through corpus to get bigram and trigram counts
for i in range(len(lower_case_corpus) - 2):
    # Getting bigram and trigram at each slide
    bigram = (lower_case_corpus[i], lower_case_corpus[i+1])
    trigram = (lower_case_corpus[i], lower_case_corpus[i+1], lower_case_corpus[i+2])

    # Keeping track of the bigram counts
    if bigram in bigram_counts.keys():
        bigram_counts[bigram] += 1
    else:
        bigram_counts[bigram] = 1

    # Keeping track of trigram counts
    if trigram in trigram_counts.keys():
        trigram_counts[trigram] += 1
    else:
        trigram_counts[trigram] = 1

print("Example, count for bigram ('the', 'king') is: " + str(bigram_counts[('the', 'king')]))

Example, count for bigram ('the', 'king') is: 51


In [None]:
# Function takes sentence as input and suggests possible words that comes after the sentence
def suggest_next_word(input_, bigram_counts, trigram_counts, vocab):
    # Consider the last bigram of sentence
    tokenized_input = word_tokenize(input_.lower())
    last_bigram = tokenized_input[-2:]

    # Calculating probability for each word in vocab
    vocab_probabilities = {}
    for vocab_word in vocab:
        test_trigram = (last_bigram[0], last_bigram[1], vocab_word)
        test_bigram = (last_bigram[0], last_bigram[1])

        test_trigram_count = trigram_counts.get(test_trigram, 0)
        test_bigram_count = bigram_counts.get(test_bigram, 0)

        probability = test_trigram_count / test_bigram_count
        vocab_probabilities[vocab_word] = probability

    # Sorting the vocab probability in descending order to get top probable words
    top_suggestions = sorted(vocab_probabilities.items(), key=lambda x: x[1], reverse=True)[:3]
    return top_suggestions

In [None]:
suggest_next_word('I am', bigram_counts, trigram_counts, vocab)

[('not', 0.11594202898550725),
 ('a', 0.06280193236714976),
 ('sure', 0.04830917874396135)]

In [None]:
suggest_next_word('the investigation produced', bigram_counts, trigram_counts, vocab)

ZeroDivisionError: ignored

In [None]:
suggest_next_word("atlanta's recent election", bigram_counts, trigram_counts, vocab)

ZeroDivisionError: ignored

In [None]:
suggest_next_word('I am the king', bigram_counts, trigram_counts, vocab)

[('james', 0.17647058823529413),
 ('of', 0.1568627450980392),
 ('arthur', 0.11764705882352941)]

In [None]:
suggest_next_word('I am the king of', bigram_counts, trigram_counts, vocab)

[('france', 0.3333333333333333),
 ('hearts', 0.16666666666666666),
 ('spain', 0.08333333333333333)]

In [None]:
suggest_next_word('I am the king of france', bigram_counts, trigram_counts, vocab)

[('.', 0.26666666666666666), ('and', 0.26666666666666666), (',', 0.2)]

In [None]:
suggest_next_word('I am the king of france and', bigram_counts, trigram_counts, vocab)

[('the', 0.2),
 ('germany', 0.13333333333333333),
 ('other', 0.06666666666666667)]