## Natural Language Processing (NLP) - Assignment 1
## Instructor:
* **Fantahun B. (Ph.D.)**
### Student:
* **Name:** Wesagn Dawit
* **ID:** GSR/5257/15
* **Program:** MSc in Artificial Intelligence
#### Introduction:
* This individual assignment is based on one version of the General Purpose Amharic Corpus (GPAC). The assignment involved building **n-gram language models for n values of n = 1,2,3,4** and evaluating their performance **intrinsically and extrinsically**. For intrinsic evaluation, **perplexity** was calculated by determining the probabilities of n-grams in the corpus and generating random sentences based on these n-grams. Since the corpus lacked labels, **text generation** served as the extrinsic evaluation method. Extrinsically, the models' ability to accurately assign probabilities to new sequences was assessed by calculating the likelihoods of test sentences under each model. Extrinsic evaluation utilizes the language models to generate text, with quality assessment based on probability. Higher probabilities indicate better generalization as the model was more likely to generate that sentence. Given an unlabeled corpus, text generation was employed to evaluate the n-gram models, creating random sentences for n=1 to 4. The generated sentences were evaluated based on their coherence and ability to make sense. The higher the n-gram model, the more coherent the generated sentence is. However, the higher the n-gram model, the more data is required to train the model. Therefore, the n-gram model should be chosen based on the available data and We think that is why our instructor specifically asked us to create n-gram models for n values of n = 1,2,3,4.

## Initializations

In [1]:
# import the necessary libraries
import gc
import re
import time
import math
import random
import collections

## Load the data

In [2]:
# read the GPAC.txt file line by line and store the lines directly into a list.
start = time.time()
with open('GPAC.txt', 'r') as f:
    data = f.readlines()[0]

In [3]:
# print sample 500 characters
print("Sample data:\n", data[:500])

Sample data:
     ምን መሰላችሁ? (አንባቢያን) ኢትዮጵያ በተደጋጋሚ ጥሪው ደርሷት ልትታደመው ያልቻለችው የአለም የእግር ኳስ ዋ ለ19ኛ ጊዜ በደቡብ አፍሪካ ሲጠጣ፣ በሩቅ እያየች አንጀቷ ባረረ ልክ በአመቱ በለስ ቀናትና ሌላ ዋ ልትታደም ሁለት ልጆቿን ወደ ደቡብ አፍሪካ ላከች፡፡6ኛው ቢግ ብራዘርስ አፍሪካ አብሮ የመኖር ውድድር በደቡብ አፍሪካ ተካሂዷል፡፡ ከተለያዩ 14 የአፍሪካ አገራት የተውጣጡ 26 ያህል ተሳታፊዎች የተካፈሉበት ይህ ውድድር፣ ግለሰቦች በፈታኝ ሁኔታ ውስጥ በማለፍ ብቃታቸውን የሚያስመሰክሩበት መሆኑን ሰምተናል፡፡ የሚገጥሟቸውን የተለያዩ ፈተናዎች በትእግስትና በጥበብ ማለፍ፣ ከሌሎች ጋር ተስማምቶ መዝለቅ፣ ችግሮችን በብልጠት መፍታት ወዘተ     በየጊዜው ከሚደረገው ቅነሳ ተርፈው ለ91 ቀናት ያህል በውድድሩ መቆየት የቻሉ ሁለት ተወዳዳሪዎች እያንዳንዳቸው 200 ሺህ ዶላር እንደሚ


## Preprocessing
#### 1. Tokenize the corpus

In [4]:
# define a list of punctuation marks
special_chars = ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']

amharic_chars = ['፡፡', "::", '፡', '።', '፣', '፤', '፥', '፦', '፧', '፨']

geez = ['፩', '፪', '፫', '፬', '፭', '፮', '፯', '፰', '፱', '፲', '፳', '፴', '፵', '፶', '፷', '፸', '፹', '፺', '፻']

puncs = list(set(special_chars + amharic_chars + geez))

In [5]:
# define a function to process the tokens
def processed_list(split_list: list):
    tokens = []
    i = 0
    while i < len(split_list):
        if (split_list[i] == ':' and i + 1 < len(split_list) and split_list[i + 1] == ':') or \
                (split_list[i] == '፡' and i + 1 < len(split_list) and split_list[i + 1] == '፡'):
            tokens.append('።')
            i += 2  # Skip the next character as it is part of a consecutive pair
        else:
            tokens.append(split_list[i])
            i += 1

    return tokens

In [6]:
# define a function to tokenize the text data
def amh_tokenizer(text: str):
    # Escape punctuations to ensure they are not interpreted as regex operators
    escaped_puncs = [re.escape(p) for p in puncs]

    # Create the regex pattern: words or any of the specified punctuations
    pattern = r'\w+|' + '|'.join(escaped_puncs)

    # Use re.findall to get all matches
    words = re.findall(pattern, text)

    # Remove empty strings and items with space only from the list
    split_list = [word for word in words if word != '' and word != ' ']

    return processed_list(split_list)


In [7]:
# tokenize the original data
tokenized_data = amh_tokenizer(data)

del data
gc.collect()

395

In [8]:
print("Number of tokens: ", len(tokenized_data))

Number of tokens:  86949986


In [9]:
# print the first 10 tokens
tokenized_data[:10]

['ምን', 'መሰላችሁ', '?', '(', 'አንባቢያን', ')', 'ኢትዮጵያ', 'በተደጋጋሚ', 'ጥሪው', 'ደርሷት']

#### 2. Remove unnecessary characters from the tokenized data
* **Purpose:** Simplify corpus, reduce noise, enhance generalization
* **Approach:** Exclude punctuation marks to focus on essential content
* **Considerations:** Task-specific requirements may warrant preserving certain punctuation **e.g: keep '::', '?'**

In [10]:
# remove punctuation marks from the tokenized data
def remove_puncs(tokens: list):
    processed_tokens = []
    for token in tokens:
        if token not in puncs or token == '።' or token == '፡፡' or token == '?':
            processed_tokens.append(token)

    return processed_tokens

processed_tokenized_data = remove_puncs(tokenized_data)

In [11]:
del tokenized_data
gc.collect()

16

In [12]:
print("Number of tokens after unecessary characters: ", len(processed_tokenized_data))

Number of tokens after unecessary characters:  82693302


#### 3. Normalize the tokens

In [13]:
# normalize the tokens by replacing vague character or sequence
def normalize(norm: str):
    replacements = {"ሃ": "ሀ", "ኅ": "ሀ", "ኃ": "ሀ", "ሐ": "ሀ", "ሓ": "ሀ", "ኻ": "ሀ", "ሑ": "ሁ",
                    "ኁ": "ሁ", "ዅ": "ሁ", "ኂ": "ሂ", "ሒ": "ሂ", "ኺ": "ሂ", "ኌ": "ሄ", "ሔ": "ሄ",
                    "ዄ": "ሄ", "ሕ": "ህ", "ኆ": "ሆ", "ሖ": "ሆ", "ኾ": "ሆ", "ሠ": "ሰ", "ሡ": "ሱ",
                    "ሢ": "ሲ", "ሣ": "ሳ", "ሤ": "ሴ", "ሥ": "ስ", "ሦ": "ሶ", "ዓ": "አ", "ኣ": "አ",
                    "ዐ": "አ", "ዑ": "ኡ", "ዒ": "ኢ", "ዔ": "ኤ", "ዕ": "እ", "ዖ": "ኦ", "ፀ": "ጸ",
                    "ፁ": "ጹ", "ጺ": "ፂ", "ጻ": "ፃ", "ጼ": "ፄ", "ፅ": "ጽ", "ፆ": "ጾ"}

    for character, replacement in replacements.items():
        norm = norm.replace(character, replacement)

    specific_patterns = [
        '(ሉ[ዋአሃ])', '(ሙ[ዋአሃ])', '(ቱ[ዋአሃ])', '(ሩ[ዋአሃ])', '(ሱ[ዋአሃ])', '(ሹ[ዋአሃ])', '(ቁ[ዋአሃ])',
        '(ቡ[ዋአሃ])', '(ቹ[ዋአሃ])', '(ሁ[ዋአሃ])', '(ኑ[ዋአሃ])', '(ኙ[ዋአሃ])', '(ኩ[ዋአሃ])', '(ዙ[ዋአሃ])',
        '(ጉ[ዋአሃ])', '(ደ[ዋአሃ])', '(ጡ[ዋአሃ])', '(ጩ[ዋአሃ])', '(ጹ[ዋአሃ])', '(ፉ[ዋአሃ])', '[ቊ]', '[ኵ]',
        '\s+'
    ]
    replacements_specific = [
        'ሏ', 'ሟ', 'ቷ', 'ሯ', 'ሷ', 'ሿ', 'ቋ',
        'ቧ', 'ቿ', 'ኋ', 'ኗ', 'ኟ', 'ኳ', 'ዟ',
        'ጓ', 'ዷ', 'ጧ', 'ጯ', 'ጿ', 'ፏ', 'ቁ', 'ኩ',
        ' '
    ]

    for pattern, replacement in zip(specific_patterns, replacements_specific):
        norm = re.sub(pattern, replacement, norm)


    return norm

In [14]:
# normalized_data = [normalize(token) for token in processed_tokenized_data]
normalized_data = []

for i in range(len(processed_tokenized_data)):
    token = processed_tokenized_data[i]
    if token in ["።", "፡፡"]:
        normalized_data.append('።')
    # For all other tokens, apply the normalize function
    else:
        normalized_data.append(normalize(token))

In [15]:
print("Number of tokens after normalization: ", len(normalized_data))

Number of tokens after normalization:  82693302


In [16]:
# print processed_tokenized_data vs normalized_data
print("processed_tokenized\t|normalized")
print("-" * 40)
for i in range(100):
    if processed_tokenized_data[i] != normalized_data[i]:
        print("{:25} {:25}".format(processed_tokenized_data[i], normalized_data[i]))

processed_tokenized	|normalized
----------------------------------------
፡፡                        ።                        
፡፡                        ።                        
፡፡                        ።                        


In [17]:
del processed_tokenized_data
gc.collect()

32

## #1. N-gram language model
### 1.1. Create n-grams for n=1, 2, 3, 4.

In [18]:
# define a function to create n-grams
def create_n_grams(tokens: list, n: int):

    # filtered_tokens = [token for token in tokens if token not in ['<s>', '</s><s>', '</q><s>']]
    filtered_tokens = [token for token in tokens]

    # Now, create n-grams from the filtered_tokens list
    return [filtered_tokens[i:i + n] for i in range(len(filtered_tokens) - n + 1)]

#### Unigrams (n=1)

In [19]:
# create n-grams for n=1
unigrams = create_n_grams(normalized_data, 1)

In [20]:
# print 10 unigrams (skip punctuation marks)
print("Unigrams:")
([unigram for unigram in unigrams[:100] if unigram[0] not in puncs][:10])

Unigrams:


[['ምን'],
 ['መሰላችሁ'],
 ['አንባቢያን'],
 ['ኢትዮጵያ'],
 ['በተደጋጋሚ'],
 ['ጥሪው'],
 ['ደርሷት'],
 ['ልትታደመው'],
 ['ያልቻለችው'],
 ['የአለም']]

#### Bigrams (n=2)

In [21]:
# create n-grams for n=2
bigrams = create_n_grams(normalized_data, 2)

In [22]:
# # Print 10 bigrams (skip punctuation marks)
# print("Bigrams:")
# ([bigram for bigram in bigrams[:100] if bigram[0] not in puncs and bigram[1] not in puncs][:10])

#### Trigrams (n=3)

In [23]:
# create n-grams for n=3
trigrams = create_n_grams(normalized_data, 3)

In [24]:
# # Print 10 trigrams (skip punctuation marks)
print("Trigrams:")
([trigram for trigram in trigrams[:100] if trigram[0] not in puncs and trigram[1] not in puncs and trigram[2] not in puncs][:10])

Trigrams:


[['አንባቢያን', 'ኢትዮጵያ', 'በተደጋጋሚ'],
 ['ኢትዮጵያ', 'በተደጋጋሚ', 'ጥሪው'],
 ['በተደጋጋሚ', 'ጥሪው', 'ደርሷት'],
 ['ጥሪው', 'ደርሷት', 'ልትታደመው'],
 ['ደርሷት', 'ልትታደመው', 'ያልቻለችው'],
 ['ልትታደመው', 'ያልቻለችው', 'የአለም'],
 ['ያልቻለችው', 'የአለም', 'የእግር'],
 ['የአለም', 'የእግር', 'ኳስ'],
 ['የእግር', 'ኳስ', 'ዋ'],
 ['ኳስ', 'ዋ', 'ለ19ኛ']]

#### Fourgrams (n=4)

In [25]:
# create n-grams for n=4
fourgrams = create_n_grams(normalized_data, 4)

In [26]:
# # Print 10 fourgrams (skip punctuation marks)
print("Fourgrams:")
([fourgram for fourgram in fourgrams[:100] if fourgram[0] not in puncs and fourgram[1] not in puncs and fourgram[2] not in puncs and fourgram[3] not in puncs][:10])

Fourgrams:


[['አንባቢያን', 'ኢትዮጵያ', 'በተደጋጋሚ', 'ጥሪው'],
 ['ኢትዮጵያ', 'በተደጋጋሚ', 'ጥሪው', 'ደርሷት'],
 ['በተደጋጋሚ', 'ጥሪው', 'ደርሷት', 'ልትታደመው'],
 ['ጥሪው', 'ደርሷት', 'ልትታደመው', 'ያልቻለችው'],
 ['ደርሷት', 'ልትታደመው', 'ያልቻለችው', 'የአለም'],
 ['ልትታደመው', 'ያልቻለችው', 'የአለም', 'የእግር'],
 ['ያልቻለችው', 'የአለም', 'የእግር', 'ኳስ'],
 ['የአለም', 'የእግር', 'ኳስ', 'ዋ'],
 ['የእግር', 'ኳስ', 'ዋ', 'ለ19ኛ'],
 ['ኳስ', 'ዋ', 'ለ19ኛ', 'ጊዜ']]

### 1.2. Calculate probabilities of n-grams and find the top 10 most likely n-grams for all n.
#### Probability of n-grams (n=1, 2, 3, 4)

In [27]:
from collections import Counter
def calculate_n_gram_probabilities(n_grams: list):
    """Calculate n-gram probabilities."""
    n_gram_counts = Counter(tuple(n_gram) for n_gram in n_grams)
    total_n_grams = sum(n_gram_counts.values())

    n_gram_probabilities = {n_gram: count / total_n_grams for n_gram, count in n_gram_counts.items()}
    return n_gram_probabilities

In [28]:
# Calculate probabilities for each n-gram size
unigram_probabilities = calculate_n_gram_probabilities(unigrams)
bigram_probabilities = calculate_n_gram_probabilities(bigrams)
trigram_probabilities = calculate_n_gram_probabilities(trigrams)
fourgram_probabilities = calculate_n_gram_probabilities(fourgrams)

#### Top 10 most likely n-grams (n=1, 2, 3, 4)

In [29]:
# print the first 5 probabilities for all n

print("\033[1m \033[92m {} \033[00m".format("\nTop 5 most likely unigrams"))
print("-" * 40)
for unigram, prob in sorted(unigram_probabilities.items(), key=lambda x: x[1], reverse=True)[:5]:
    if unigram[0] not in puncs:
        print("{:20} {:10.8f}".format(unigram[0], prob))


[1m [92m 
Top 5 most likely unigrams [00m
----------------------------------------
ነው                   0.01410268
ላይ                   0.00912525
ውስጥ                  0.00419036


In [30]:
# print top 10 most likely bi-grams
print("\033[1m \033[92m {} \033[00m".format("\nTop 10 most likely bigrams"))
print("-" * 40)
for bigram, prob in sorted(bigram_probabilities.items(), key=lambda x: x[1], reverse=True)[:10]:
    if bigram[0] not in puncs and bigram[1] not in puncs:
        print("{:20} {:10.8f}".format(bigram[0] + " " + bigram[1], prob))

[1m [92m 
Top 10 most likely bigrams [00m
----------------------------------------
አ ም                  0.00126824


In [31]:
# print top 10 most likely tri-grams
print("\033[1m \033[92m {} \033[00m".format("\nTop 10 most likely trigrams"))
print("-" * 40)
for trigram, prob in sorted(trigram_probabilities.items(), key=lambda x: x[1], reverse=True)[:10]:
    if trigram[0] not in puncs and trigram[1] not in puncs and trigram[2] not in puncs:
        print("{:20} {:10.8f}".format(trigram[0] + " " + trigram[1] + " " + trigram[2], prob))


[1m [92m 
Top 10 most likely trigrams [00m
----------------------------------------
እ ኤ አ                0.00020262


In [32]:
# print top 10 most likely four-grams
print("\033[1m \033[92m {} \033[00m".format("\nTop 10 most likely fourgrams"))
print("-" * 40)
for fourgram, prob in sorted(fourgram_probabilities.items(), key=lambda x: x[1], reverse=True)[:10]:
    if fourgram[0] not in puncs and fourgram[1] not in puncs and fourgram[2] not in puncs and fourgram[3] not in puncs:
        print("{:20} {:10.8f}".format(fourgram[0] + " " + fourgram[1] + " " + fourgram[2] + " " + fourgram[3], prob))


[1m [92m 
Top 10 most likely fourgrams [00m
----------------------------------------
አ ም ኢሳት ዜና           0.00011706
ቀን 2008 አ ም          0.00004671
ቀን 2007 አ ም          0.00004074
ቀን 2010 አ ም          0.00004027
ቀን 2011 አ ም          0.00003923


### 1.3. Probability of a given sentence

In [33]:
# what is the probability of a given sentence using four-grams
sentence = normalize("ኢትዮጵያ ታሪካዊ ሀገር ናት")
fourgram_probabilities[tuple(sentence.split())]

1.2092878287513962e-08

In [34]:
# find the probability of a given sentence
def sentence_probability(sentence, n, unigram_probabilities):
    words = sentence.split()
    total_probability = 1.0

    if n == 1:
        n_grams = unigram_probabilities
        sentence_grams = [(words[i],) for i in range(len(words))]  # Single word tuples
    elif n == 2:
        n_grams = bigram_probabilities
        sentence_grams = [(words[i], words[i+1]) for i in range(len(words) - 1)]
    elif n == 3:
        n_grams = trigram_probabilities
        sentence_grams = [(words[i], words[i+1], words[i+2]) for i in range(len(words) - 2)]
    elif n == 4:
        n_grams = fourgram_probabilities
        sentence_grams = [(words[i], words[i+1], words[i+2], words[i+3]) for i in range(len(words) - 3)]
    else:
        raise ValueError("n must be between 1 and 4")

    for gram in sentence_grams:
        prob = n_grams.get(gram, 1e-10)  # Get the probability, default to 0.0000000001 if not found
        total_probability *= prob

    return total_probability

In [35]:
# calculate the probability of a sentence for sentence
sentence = "ኢትዮጵያ በተደጋጋሚ ጥሪው ደርሷት ልትታደመው" #ኢትዮጵያ ታሪካዊ ሀገር ናት
uingtram_probability = sentence_probability(sentence, 1, unigram_probabilities)
bigram_probability = sentence_probability(sentence, 2, bigram_probabilities)
trigram_probability = sentence_probability(sentence, 3, trigram_probabilities)
fourgram_probability = sentence_probability(sentence, 4, fourgram_probabilities)

In [36]:
print(f"Unigram probability of sentence '{sentence}': {uingtram_probability}")
print(f"Bigram probability of sentence '{sentence}': {bigram_probability}")
print(f"Trigram probability of sentence '{sentence}': {trigram_probability}")
print(f"Fourgram probability of sentence '{sentence}': {fourgram_probability}")

Unigram probability of sentence 'ኢትዮጵያ በተደጋጋሚ ጥሪው ደርሷት ልትታደመው': 1.0478356707280566e-26
Bigram probability of sentence 'ኢትዮጵያ በተደጋጋሚ ጥሪው ደርሷት ልትታደመው': 8.725269465276984e-30
Trigram probability of sentence 'ኢትዮጵያ በተደጋጋሚ ጥሪው ደርሷት ልትታደመው': 1.414747765439349e-23
Fourgram probability of sentence 'ኢትዮጵያ በተደጋጋሚ ጥሪው ደርሷት ልትታደመው': 5.849508211065065e-16


In [37]:
# sort
sorted = [uingtram_probability, bigram_probability, trigram_probability, fourgram_probability]
sorted.sort()
print(sorted)

[8.725269465276984e-30, 1.0478356707280566e-26, 1.414747765439349e-23, 5.849508211065065e-16]


### 1.4. Generate random sentences using the n-grams for n=1, 2, 3, 4

In [38]:
# generate random sentences using the n-grams for n=1, 2, 3, 4
def generate_random_sentence(n_gram_probabilities):
    sentence_end_markers = ['።', '?']
    sentence = []
    non_marker_added = False  # Flag to check if at least one non-marker word has been added

    while True:
        word_tuple = random.choices(list(n_gram_probabilities.keys()), weights=n_gram_probabilities.values())[0]
        joined_word = []
        end_marker = ""
        end_marker_flag = False

        for word in word_tuple:

            # if word in sentence_end_markers:
            #     print(word)

            if word in sentence_end_markers:
                end_marker = word
                end_marker_flag = True
                break
            else:
                joined_word.append(word)
                non_marker_added = True

        if end_marker_flag:
            sentence.append(' '.join(joined_word))
            sentence.append(end_marker)
            break
        else:
            sentence.append(' '.join(joined_word))

    if not non_marker_added:
        return generate_random_sentence(n_gram_probabilities)

    return ' '.join(sentence)

In [39]:
generated_sentence = generate_random_sentence(unigram_probabilities)
print("\033[1m \033[92m {} \033[00m {}".format("\nUnigram sentence: \n", generated_sentence))

generated_sentence = generate_random_sentence(bigram_probabilities)
print("\033[1m \033[92m {} \033[00m {}".format("\nBigram sentence: \n", generated_sentence))

generated_sentence = generate_random_sentence(trigram_probabilities)
print("\033[1m \033[92m {} \033[00m {}".format("\nTrigram sentence: \n", generated_sentence))

generated_sentence = generate_random_sentence(fourgram_probabilities)
print("\033[1m \033[92m {} \033[00m {}".format("\nFourgram sentence: \n", generated_sentence))

[1m [92m 
Unigram sentence: 
 [00m ስእል በጎ አጠራጣሪ ውስጥ  ።
[1m [92m 
Bigram sentence: 
 [00m ስድስት የወርቅ አሉ ።
[1m [92m 
Trigram sentence: 
 [00m ምንጮች ያስረዳሉ ።
[1m [92m 
Fourgram sentence: 
 [00m ለማቋቋም አስበውና ቆርጠው የተነሱ ሙሉ ድጋፍ ይሻሉ እ በየስብሰባዎቹ በአብዛኛው የሚገኙት ሴቶች አይዋጥ አላቸው ።


#### Explanation:
* **Unigram sentence:** The unigram sentence is not coherent and does not make sense. This is because the unigram model does not consider the context of the words. It only considers the probability of each word in the sentence. Therefore, the unigram model is not suitable for generating sentences.
* **Bigram sentence:** The bigram sentence is more coherent than the unigram sentence. This is because the bigram model considers the probability of each word given the previous word. Therefore, the bigram model is more suitable for generating sentences than the unigram model.
* **Trigram sentence:** The trigram sentence is more coherent than the bigram sentence. This is because the trigram model considers the probability of each word given the previous two words. Therefore, the trigram model is more suitable for generating sentences than the bigram model.
* **Fourgram sentence:** The fourgram sentence is more coherent than the trigram sentence. This is because the fourgram model considers the probability of each word given the previous three words. Therefore, the fourgram model is more suitable for generating sentences than the trigram model.
* **Conclusion:** The higher the n-gram model, the more coherent the generated sentence is. However, the higher the n-gram model, the more data is required to train the model. Therefore, the n-gram model should be chosen based on the available data.

## #2. Evaluate these Language Models Using Intrinsic Evaluation Method
#### Calculate perplexity for n_gram language models for n=1, 2, 3, 4

In [40]:
# define a function to calculate perplexity
def calculate_perplexity(ngrams, total_words):

    model_entropy = 0.0
    for ngram, count in ngrams.items():
        prob = count / total_words
        model_entropy += -math.log2(prob)

    model_entropy /= total_words
    return math.pow(2, model_entropy)

def evaluate_model(n, tokens):

    ngrams = collections.Counter(tuple(tokens[i:i + n]) for i in range(len(tokens) - n + 1))
    total_words = len(tokens)

    return calculate_perplexity(ngrams, total_words)

In [41]:
# Evaluating n-gram models for n = 1, 2, 3, 4
perplexities = []
for n in range(1, 5):
    perplexity = evaluate_model(n, normalized_data)
    perplexities.append(perplexity)

# Printing the results
for n, perplexity in enumerate(perplexities):
    print(f"n = {n + 1}, Perplexity: {perplexity}")

n = 1, Perplexity: 1.4661943165044653
n = 2, Perplexity: 186.01227838075573
n = 3, Perplexity: 31872.187005459826
n = 4, Perplexity: 386061.670096585


#### Explanation:
* In general, higher n-gram models (such as trigrams, fourgrams) have higher perplexity compared to lower order n-grams (unigrams, bigrams) often indicate potential challenges or limitations in language model performance. Higher perplexity values imply higher uncertainty and poorer predictions made by the language model.

## #3 Evaluate these Language Models Using Extrinsic Evaluation Method

In [42]:
def evaluate_language_model(language_model, sentences, n):
    total_probability = 1.0  # we start with a base probability of 1 as multiplying anything by 1 leaves it unchanged

    for sentence in sentences:
        words = sentence.split()

        for i in range(n - 1, len(words)):
            phrase = tuple(words[i - (n-1):i + 1])

            if phrase in language_model:
                total_probability *= language_model[phrase]
            else:
                total_probability *= 1e-6  # or some small probability for unseen n-grams

    # return 0 if the total_probability is 1.0
    return total_probability if total_probability != 1.0 else 0

In [43]:
sentences1 = ["ኢትዮጵያ በተደጋጋሚ ጥሪው ደርሷት ልትታደመው", "ኢትዮጵያ ታሪካዊ ሀገር ናት ሀገር ኢትዮጵያ"]

sentences2 = ["ደርሷትልትታደመውጥሪው ", "ሀገርኢትዮጵያ ናትታሪካዊ ሀገርኢትዮጵያ", "ኢትዮጵያታሪካዊ ሀገርናትየህዝብበተደጋጋሚ"]

# sentences = [" ".join(normalized_data[-6:])]
sentences = "ኢትዮጵያ በተደጋጋሚ ጥሪው ደርሷት ልትታደመው"

# sentence = "ወሳኝ መሆኑን የተገነዘበው ወዳጄ ነው እንግዲህ"
eval_unigram_probability = evaluate_language_model(unigram_probabilities, sentences, 1)
eval_bigram_probability = evaluate_language_model(bigram_probabilities, sentences, 2)
eval_trigram_probability = evaluate_language_model(trigram_probabilities, sentences, 3)
eval_fourgram_probability = evaluate_language_model(fourgram_probabilities, sentences, 4)

In [44]:
# print the results
print("\nThe probability of sentences '{}' is to be generated by unigram model is: {}".format(sentences, eval_unigram_probability))

print("\nThe probability of sentences '{}' is to be generated by bigram model is: {}".format(sentences, eval_bigram_probability))
#
print("\nThe probability of sentences '{}' is to be generated by trigram model is: {}".format(sentences, eval_trigram_probability))

print("\nThe probability of sentences '{}' is to be generated by fourgram model is: {}".format(sentences, eval_fourgram_probability))


The probability of sentences 'ኢትዮጵያ በተደጋጋሚ ጥሪው ደርሷት ልትታደመው' is to be generated by unigram model is: 1.8020207048523145e-107

The probability of sentences 'ኢትዮጵያ በተደጋጋሚ ጥሪው ደርሷት ልትታደመው' is to be generated by bigram model is: 0

The probability of sentences 'ኢትዮጵያ በተደጋጋሚ ጥሪው ደርሷት ልትታደመው' is to be generated by trigram model is: 0

The probability of sentences 'ኢትዮጵያ በተደጋጋሚ ጥሪው ደርሷት ልትታደመው' is to be generated by fourgram model is: 0


In [45]:
end = time.time()

print("Totol time of execution is {} seconds.".format(end-start))

Totol time of execution is 4594.860609292984 seconds.


#### Explanation:
* In general the higher the n-gram model, the more coherent the generated sentence is. We calculated the probabilities of test sentences under each n-gram model. The assignment of probabilities varied depending on the sentence contents and how well it matched patterns in the training data. Some sentences may have received higher probabilities from higher n models, while others could be better predicted by lower n models. No definitive claim can be made that probabilities will consistently increase or decrease with n.

### Conclusion:
* We have dedicated considerable effort to completing this assignment, engaging in discussions with my peers, particularly Debela, to address the various aspects of the tasks. Our collaborative efforts have been beneficial in broadening our understanding and learning significantly throughout this process