# **Aim:**

 Write a program to generate n-gram (bigram,trigram,etc) of English and Hindi Text

# **Theory:**

**What Are N-Grams?**

N-Grams are words, or combinations of words, broken out by the number of words in that combination. As an outline:

Unigrams: one word
Bigrams: two words
Trigrams: three words
And so forth


To further explore n-grams, we can break down the sentence below:
Hi there everyone, we’re exploring n-grams today.

Unigram: hi | there | everyone, etc… <br>
Bigram: hi there | exploring n-grams | etc…  <br>
Trigram: hi there everyone | exploring n-grams today | etc…  <br>
Note that the words must follow sequentially to be an n-gram.

*    Imagine listening to someone as they speak and trying to guess the next word that they are going to say. For example what word is likely to follow this sentence fragment?:
I’d like to make a . . .     / Please hand over your
*   Guessing the next word (or word prediction) is an essential subtask of speech recognition, hand-writing recognition, augmentative communication for the disabled, and spelling error detection.
*   In such tasks, word-identification is difficult because the input is very noisy and ambiguous.
*   Thus looking at previous words can give us an important cue about what the next ones are going to be.
*  N-gram models, which predict the next word from the previous N − 1 words.
*   Such statistical models of word sequences are also called language models or LMs.
*   Computing the probability of the next word will turn out to be closely related to computing the probability of a sequence of words.
*   The following sequence, for example, has a non-zero probability of appearing in a text: 
. . . all of a sudden I notice three guys standing on the sidewalk... 
*   while this same set of words in a different order has a much much lower probability: 
on guys all I of notice sidewalk three a sudden standing the
*  It can also help to make spelling error corrections.
*  For instance, the sentence “drink cofee” could be corrected to “drink coffee” if you knew that the word “coffee” had a high probability of occurrence after the word “drink” and also the overlap of letters between “cofee” and “coffee” is high
*  Let’s start with equation P(w|h), the probability of word w, given some history, h. For example, P(The|its water is so transparent that)
Here,<br>
w = The <br>
h = its water is so transparent that
*  And, one way to estimate the above probability function is through the relative frequency count approach, where you would take a substantially large corpus, count the number of times you see *its water is so transparent that*, and then count the number of times it is followed by *the*. 
*  In other words, you are answering the question:
Out of the times you saw the history h, how many times did the word w follow it
P(the|its water is so transparent that) = C(its water is so transparent that)/C(its water is so transparent that the)
*   You can imagine, it is not feasible to perform this over an entire corpus; especially if it is of a significant size.
*   This shortcoming and ways to decompose the probability function using the chain rule serves as the base intuition of the N-gram model. Here, you, instead of computing probability using the entire corpus, would approximate it by just a few historical words






In [None]:
import re 
from nltk.util import ngrams
s = "Natural Language Processing, or NLP for short, is broadly defined as the automatic manipulation of natural language, like speech and text, by software. The study of natural language processing has been around for more than 50 years and grew out of the field of linguistics with the rise of computers.In this post, you will discover what natural language processing is and why it is so important."
s = s.lower()
s = re.sub(r'[^a-zA-Z0-9\s]',' ', s)
tokens = [token for token in s.split(" ") if token != ""]
output = list(ngrams(tokens,3))

In [None]:
print(output)

[('natural', 'language', 'processing'), ('language', 'processing', 'or'), ('processing', 'or', 'nlp'), ('or', 'nlp', 'for'), ('nlp', 'for', 'short'), ('for', 'short', 'is'), ('short', 'is', 'broadly'), ('is', 'broadly', 'defined'), ('broadly', 'defined', 'as'), ('defined', 'as', 'the'), ('as', 'the', 'automatic'), ('the', 'automatic', 'manipulation'), ('automatic', 'manipulation', 'of'), ('manipulation', 'of', 'natural'), ('of', 'natural', 'language'), ('natural', 'language', 'like'), ('language', 'like', 'speech'), ('like', 'speech', 'and'), ('speech', 'and', 'text'), ('and', 'text', 'by'), ('text', 'by', 'software'), ('by', 'software', 'the'), ('software', 'the', 'study'), ('the', 'study', 'of'), ('study', 'of', 'natural'), ('of', 'natural', 'language'), ('natural', 'language', 'processing'), ('language', 'processing', 'has'), ('processing', 'has', 'been'), ('has', 'been', 'around'), ('been', 'around', 'for'), ('around', 'for', 'more'), ('for', 'more', 'than'), ('more', 'than', '50')

In [None]:
output = list(ngrams(tokens,2))
print(output)

[('natural', 'language'), ('language', 'processing'), ('processing', 'or'), ('or', 'nlp'), ('nlp', 'for'), ('for', 'short'), ('short', 'is'), ('is', 'broadly'), ('broadly', 'defined'), ('defined', 'as'), ('as', 'the'), ('the', 'automatic'), ('automatic', 'manipulation'), ('manipulation', 'of'), ('of', 'natural'), ('natural', 'language'), ('language', 'like'), ('like', 'speech'), ('speech', 'and'), ('and', 'text'), ('text', 'by'), ('by', 'software'), ('software', 'the'), ('the', 'study'), ('study', 'of'), ('of', 'natural'), ('natural', 'language'), ('language', 'processing'), ('processing', 'has'), ('has', 'been'), ('been', 'around'), ('around', 'for'), ('for', 'more'), ('more', 'than'), ('than', '50'), ('50', 'years'), ('years', 'and'), ('and', 'grew'), ('grew', 'out'), ('out', 'of'), ('of', 'the'), ('the', 'field'), ('field', 'of'), ('of', 'linguistics'), ('linguistics', 'with'), ('with', 'the'), ('the', 'rise'), ('rise', 'of'), ('of', 'computers'), ('computers', 'in'), ('in', 'this')

In [None]:
import nltk
!pip install nltk
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/biocreative_ppi.zip.
[nltk_data]    | Downloading package brown to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown.zip.
[nltk_data]    | Downloading package brown_tei to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/brown_tei.zip.
[nltk_data]    | Downloading package cess_cat to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_cat.zip.
[nltk_data]    | Downloading package cess_esp to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cess_esp.zip.
[nltk_data]    | Downloading package chat80 to /root/nltk_data...
[nltk_data]    |   Unzipp

True

In [None]:
import nltk
from nltk.util import ngrams
 
# Function to generate n-grams from sentences.
def extract_ngrams(data, num):
    n_grams = ngrams(nltk.word_tokenize(data), num)
    return [ ' '.join(grams) for grams in n_grams]
 
data = 'A class is a blueprint for the object.'
 
print("1-gram: ", extract_ngrams(data, 1))
print("2-gram: ", extract_ngrams(data, 2))
print("3-gram: ", extract_ngrams(data, 3))
print("4-gram: ", extract_ngrams(data, 4))

1-gram:  ['A', 'class', 'is', 'a', 'blueprint', 'for', 'the', 'object', '.']
2-gram:  ['A class', 'class is', 'is a', 'a blueprint', 'blueprint for', 'for the', 'the object', 'object .']
3-gram:  ['A class is', 'class is a', 'is a blueprint', 'a blueprint for', 'blueprint for the', 'for the object', 'the object .']
4-gram:  ['A class is a', 'class is a blueprint', 'is a blueprint for', 'a blueprint for the', 'blueprint for the object', 'for the object .']


In [None]:
data = 'एक वर्ग वस्तु के लिए एक खाका है.'
 
print("1-gram: ", extract_ngrams(data, 1))
print("2-gram: ", extract_ngrams(data, 2))
print("3-gram: ", extract_ngrams(data, 3))
print("4-gram: ", extract_ngrams(data, 4))

1-gram:  ['एक', 'वर्ग', 'वस्तु', 'के', 'लिए', 'एक', 'खाका', 'है', '.']
2-gram:  ['एक वर्ग', 'वर्ग वस्तु', 'वस्तु के', 'के लिए', 'लिए एक', 'एक खाका', 'खाका है', 'है .']
3-gram:  ['एक वर्ग वस्तु', 'वर्ग वस्तु के', 'वस्तु के लिए', 'के लिए एक', 'लिए एक खाका', 'एक खाका है', 'खाका है .']
4-gram:  ['एक वर्ग वस्तु के', 'वर्ग वस्तु के लिए', 'वस्तु के लिए एक', 'के लिए एक खाका', 'लिए एक खाका है', 'एक खाका है .']


**What is a Language Model in NLP?**

A language model learns to predict the probability of a sequence of words. But why do we need to learn the probability of words? Let’s understand that with an example.

One of the use of language model is in Machine Translation, you take in a bunch of words from a language and convert these words into another language. Now, there can be many potential translations that a system might give you and you will want to compute the probability of each of these translations to understand which one is the most accurate.

In the above example, we know that the probability of the first sentence will be more than the second, right? That’s how we arrive at the right translation.

This ability to model the rules of a language as a probability gives great power for NLP related tasks. Language models are used in speech recognition, machine translation, part-of-speech tagging, parsing, Optical Character Recognition, handwriting recognition, information retrieval, and many other daily tasks.

There are primarily two types of Language Models:

*  Statistical Language Models: These models use traditional statistical techniques like N-grams, Hidden Markov Models (HMM) and certain linguistic rules to learn the probability distribution of words
*  Neural Language Models: These are new players in the NLP town and have surpassed the statistical language models in their effectiveness. They use different kinds of Neural Networks to model language

How do N-gram Language Models work?

<img src="https://drive.google.com/thumbnail?id=1zskf8VrpdsJXD1aiyGr67RFzsZqtiUgO" height="100">

An N-gram language model predicts the probability of a given N-gram within any sequence of words in the language. If we have a good N-gram model, we can predict p(w | h) – what is the probability of seeing the word w given a history of previous words h – where the history contains n-1 words.

We must estimate this probability to construct an N-gram model.

We compute this probability in two steps:

1.   Apply the chain rule of probability
2.   We then apply a very strong simplification assumption to allow us to compute p(w1…ws) in an easy manner


The chain rule of probability is:

p(w1...ws) = p(w1) . p(w2 | w1) . p(w3 | w1 w2) . p(w4 | w1 w2 w3) ..... p(wn | w1...wn-1)

So what is the chain rule? It tells us how to compute the joint probability of a sequence by using the conditional probability of a word given previous words.

But we do not have access to these conditional probabilities with complex conditions of up to n-1 words. So how do we proceed?

This is where we introduce a simplification assumption. We can assume for all conditions, that:

p(wk | w1...wk-1) = p(wk | wk-1)

Here, we approximate the history (the context) of the word wk by looking only at the last word of the context. This assumption is called the Markov assumption. (We used it here with a simplified context of length 1 – which corresponds to a bigram model – we could use larger fixed-sized histories in general).

 
Building a Basic Language Model
Now that we understand what an N-gram is, let’s build a basic language model using trigrams of the Reuters corpus. Reuters corpus is a collection of 10,788 news documents totaling 1.3 million words. We can build a language model in a few lines of code using the NLTK package:

In [None]:
from nltk.corpus import reuters
from nltk import bigrams, trigrams
from collections import Counter, defaultdict

# Create a placeholder for model
model = defaultdict(lambda: defaultdict(lambda: 0))

# Count frequency of co-occurance  
for sentence in reuters.sents():
    for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
        model[(w1, w2)][w3] += 1
 
# Let's transform the counts to probabilities
for w1_w2 in model:
    total_count = float(sum(model[w1_w2].values()))
    for w3 in model[w1_w2]:
        model[w1_w2][w3] /= total_count

In the above We first split our text into trigrams with the help of NLTK and then calculate the frequency in which each combination of the trigrams occurs in the dataset.

We then use it to calculate probabilities of a word, given the previous two words. That’s essentially what gives us our Language Model!

Let’s make simple predictions with this language model. We will start with two simple words – “today the”. We want our model to tell us what will be the next word:

NameError: ignored

So we get predictions of all the possible words that can come next with their respective probabilities. Now, if we pick up the word “price” and again make a prediction for the words “the” and “price”:

In [None]:
dict(model["the","price"])

{}

If we keep following this process iteratively, we will soon have a coherent sentence! Here is a script to play around with generating a random piece of text using our n-gram model:

In [None]:
import random

# starting words
text = ["today", "the"]
sentence_finished = False
 
while not sentence_finished:
  # select a random probability threshold  
  r = random.random()
  accumulator = .0

  for word in model[tuple(text[-2:])].keys():
      accumulator += model[tuple(text[-2:])][word]
      # select words that are above the probability threshold
      if accumulator >= r:
          text.append(word)
          break

  if text[-2:] == [None, None]:
      sentence_finished = True
 
print (' '.join([t for t in text if t]))

Limitations of N-gram approach to Language Modeling N-gram based language models do have a few drawbacks:

The higher the N, the better is the model usually. But this leads to lots of computation overhead that requires large computation power in terms of RAM N-grams are a sparse representation of language. This is because we build the model based on the probability of words co-occurring. It will give zero probability to all the words that are not present in the training corpus

# **Conclusion :** 

Hence we implemented program to generate n-gram (bigram,trigram,etc) of English and Hindi Text