## Natural Language Processing (NLP) - Assignment 1
## Instructor:
* **Fantahun B. (Ph.D.)**
### Student:
* **Name:** Wesagn Dawit
* **ID:** GSR/5257/15
* **Program:** MSc in Artificial Intelligence
#### Introduction:
* This individual assignment is based on one version of the General Purpose Amharic Corpus (GPAC). The assignment involved building **n-gram language models for n values of n = 1,2,3,4** and evaluating their performance **intrinsically and extrinsically**. For intrinsic evaluation, **perplexity** was calculated by determining the probabilities of n-grams in the corpus and generating random sentences based on these n-grams. Since the corpus lacked labels, **text generation** served as the extrinsic evaluation method. Extrinsically, the models' ability to accurately assign probabilities to new sequences was assessed by calculating the likelihoods of test sentences under each model. Extrinsic evaluation utilizes the language models to generate text, with quality assessment based on probability. Higher probabilities indicate better generalization as the model was more likely to generate that sentence. Given an unlabeled corpus, text generation was employed to evaluate the n-gram models, creating random sentences for n=1 to 4. The generated sentences were evaluated based on their coherence and ability to make sense. The higher the n-gram model, the more coherent the generated sentence is. However, the higher the n-gram model, the more data is required to train the model. Therefore, the n-gram model should be chosen based on the available data and I think that is why our instructor specifically asked us to create n-gram models for n values of n = 1,2,3,4.

## Initializations

In [1]:
# import the necessary libraries
import re
import math
import time
import random
import collections

## Load the data

In [2]:
# read the GPAC.txt file line by line and store the lines directly into a list.
import time

start = time.time()

with open('GPAC.txt', 'r') as f:
    data = [line.strip() for line in f]
data = data[0]

In [3]:
# print sample 1000 characters
print("Sample data:\n", data[:500])

Sample data:
 ምን መሰላችሁ? (አንባቢያን) ኢትዮጵያ በተደጋጋሚ ጥሪው ደርሷት ልትታደመው ያልቻለችው የአለም የእግር ኳስ ዋ ለ19ኛ ጊዜ በደቡብ አፍሪካ ሲጠጣ፣ በሩቅ እያየች አንጀቷ ባረረ ልክ በአመቱ በለስ ቀናትና ሌላ ዋ ልትታደም ሁለት ልጆቿን ወደ ደቡብ አፍሪካ ላከች፡፡6ኛው ቢግ ብራዘርስ አፍሪካ አብሮ የመኖር ውድድር በደቡብ አፍሪካ ተካሂዷል፡፡ ከተለያዩ 14 የአፍሪካ አገራት የተውጣጡ 26 ያህል ተሳታፊዎች የተካፈሉበት ይህ ውድድር፣ ግለሰቦች በፈታኝ ሁኔታ ውስጥ በማለፍ ብቃታቸውን የሚያስመሰክሩበት መሆኑን ሰምተናል፡፡ የሚገጥሟቸውን የተለያዩ ፈተናዎች በትእግስትና በጥበብ ማለፍ፣ ከሌሎች ጋር ተስማምቶ መዝለቅ፣ ችግሮችን በብልጠት መፍታት ወዘተ     በየጊዜው ከሚደረገው ቅነሳ ተርፈው ለ91 ቀናት ያህል በውድድሩ መቆየት የቻሉ ሁለት ተወዳዳሪዎች እያንዳንዳቸው 200 ሺህ ዶላር እንደሚሸለሙም


## Preprocessing
#### 1. Tokenize the corpus

In [4]:
# define a list of punctuation marks
special_chars = ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']

amharic_chars = ['፡፡', "::", '፡', '።', '፣', '፤', '፥', '፦', '፧', '፨']

geez = ['፩', '፪', '፫', '፬', '፭', '፮', '፯', '፰', '፱', '፲', '፳', '፴', '፵', '፶', '፷', '፸', '፹', '፺', '፻']

puncs = list(set(special_chars + amharic_chars + geez))

In [5]:
# define a function to process the tokens
def processed_list(split_list: list):
    tokens = []
    i = 0
    while i < len(split_list):
        if (split_list[i] == ':' and i + 1 < len(split_list) and split_list[i + 1] == ':') or \
                (split_list[i] == '፡' and i + 1 < len(split_list) and split_list[i + 1] == '፡'):
            tokens.append('።')
            i += 2  # Skip the next character as it is part of a consecutive pair
        else:
            tokens.append(split_list[i])
            i += 1

    return tokens

In [6]:
# define a function to tokenize the text data
def amh_tokenizer(text: str):
    # Escape punctuations to ensure they are not interpreted as regex operators
    escaped_puncs = [re.escape(p) for p in puncs]

    # Create the regex pattern: words or any of the specified punctuations
    pattern = r'\w+|' + '|'.join(escaped_puncs)

    # Use re.findall to get all matches
    words = re.findall(pattern, text)

    # Remove empty strings and items with space only from the list
    split_list = [word for word in words if word != '' and word != ' ']

    return processed_list(split_list)


In [7]:
# tokenize the original data
tokenized_data = amh_tokenizer(data)

In [8]:
# print the first 1000 tokens
tokenized_data[:1000]

['ምን',
 'መሰላችሁ',
 '?',
 '(',
 'አንባቢያን',
 ')',
 'ኢትዮጵያ',
 'በተደጋጋሚ',
 'ጥሪው',
 'ደርሷት',
 'ልትታደመው',
 'ያልቻለችው',
 'የአለም',
 'የእግር',
 'ኳስ',
 'ዋ',
 'ለ19ኛ',
 'ጊዜ',
 'በደቡብ',
 'አፍሪካ',
 'ሲጠጣ',
 '፣',
 'በሩቅ',
 'እያየች',
 'አንጀቷ',
 'ባረረ',
 'ልክ',
 'በአመቱ',
 'በለስ',
 'ቀናትና',
 'ሌላ',
 'ዋ',
 'ልትታደም',
 'ሁለት',
 'ልጆቿን',
 'ወደ',
 'ደቡብ',
 'አፍሪካ',
 'ላከች',
 '፡፡',
 '6ኛው',
 'ቢግ',
 'ብራዘርስ',
 'አፍሪካ',
 'አብሮ',
 'የመኖር',
 'ውድድር',
 'በደቡብ',
 'አፍሪካ',
 'ተካሂዷል',
 '፡፡',
 'ከተለያዩ',
 '14',
 'የአፍሪካ',
 'አገራት',
 'የተውጣጡ',
 '26',
 'ያህል',
 'ተሳታፊዎች',
 'የተካፈሉበት',
 'ይህ',
 'ውድድር',
 '፣',
 'ግለሰቦች',
 'በፈታኝ',
 'ሁኔታ',
 'ውስጥ',
 'በማለፍ',
 'ብቃታቸውን',
 'የሚያስመሰክሩበት',
 'መሆኑን',
 'ሰምተናል',
 '፡፡',
 'የሚገጥሟቸውን',
 'የተለያዩ',
 'ፈተናዎች',
 'በትእግስትና',
 'በጥበብ',
 'ማለፍ',
 '፣',
 'ከሌሎች',
 'ጋር',
 'ተስማምቶ',
 'መዝለቅ',
 '፣',
 'ችግሮችን',
 'በብልጠት',
 'መፍታት',
 'ወዘተ',
 'በየጊዜው',
 'ከሚደረገው',
 'ቅነሳ',
 'ተርፈው',
 'ለ91',
 'ቀናት',
 'ያህል',
 'በውድድሩ',
 'መቆየት',
 'የቻሉ',
 'ሁለት',
 'ተወዳዳሪዎች',
 'እያንዳንዳቸው',
 '200',
 'ሺህ',
 'ዶላር',
 'እንደሚሸለሙም',
 'ሲናገር',
 'ነበር',
 '፡፡',
 'በዘንድሮው',
 'ውድድር',
 'አገራችን',
 'ዳኒ',
 'እና',


#### 2. Remove unnecessary characters from the tokenized data
* **Purpose:** Simplify corpus, reduce noise, enhance generalization
* **Approach:** Exclude punctuation marks to focus on essential content
* **Considerations:** Task-specific requirements may warrant preserving certain punctuation **e.g: keep '::', '?'**

In [9]:
# remove punctuation marks from the tokenized data
def remove_puncs(tokens: list):
    processed_tokens = []
    for token in tokens:
        if token not in puncs or token == '።' or token == '፡፡' or token == '?':
            processed_tokens.append(token)

    return processed_tokens

processed_tokenized_data = remove_puncs(tokenized_data)

#### 3. Normalize the tokens

In [10]:
# normalize the tokens by replacing vague character or sequence
def normalize(norm: str):
    replacements = {"ሃ": "ሀ", "ኅ": "ሀ", "ኃ": "ሀ", "ሐ": "ሀ", "ሓ": "ሀ", "ኻ": "ሀ", "ሑ": "ሁ", 
                    "ኁ": "ሁ", "ዅ": "ሁ", "ኂ": "ሂ", "ሒ": "ሂ", "ኺ": "ሂ", "ኌ": "ሄ", "ሔ": "ሄ", 
                    "ዄ": "ሄ", "ሕ": "ህ", "ኆ": "ሆ", "ሖ": "ሆ", "ኾ": "ሆ", "ሠ": "ሰ", "ሡ": "ሱ", 
                    "ሢ": "ሲ", "ሣ": "ሳ", "ሤ": "ሴ", "ሥ": "ስ", "ሦ": "ሶ", "ዓ": "አ", "ኣ": "አ", 
                    "ዐ": "አ", "ዑ": "ኡ", "ዒ": "ኢ", "ዔ": "ኤ", "ዕ": "እ", "ዖ": "ኦ", "ፀ": "ጸ", 
                    "ፁ": "ጹ", "ጺ": "ፂ", "ጻ": "ፃ", "ጼ": "ፄ", "ፅ": "ጽ", "ፆ": "ጾ"}

    for character, replacement in replacements.items():
        norm = norm.replace(character, replacement)

    specific_patterns = [
        '(ሉ[ዋአሃ])', '(ሙ[ዋአሃ])', '(ቱ[ዋአሃ])', '(ሩ[ዋአሃ])', '(ሱ[ዋአሃ])', '(ሹ[ዋአሃ])', '(ቁ[ዋአሃ])',
        '(ቡ[ዋአሃ])', '(ቹ[ዋአሃ])', '(ሁ[ዋአሃ])', '(ኑ[ዋአሃ])', '(ኙ[ዋአሃ])', '(ኩ[ዋአሃ])', '(ዙ[ዋአሃ])',
        '(ጉ[ዋአሃ])', '(ደ[ዋአሃ])', '(ጡ[ዋአሃ])', '(ጩ[ዋአሃ])', '(ጹ[ዋአሃ])', '(ፉ[ዋአሃ])', '[ቊ]', '[ኵ]',
        '\s+'
    ]
    replacements_specific = [
        'ሏ', 'ሟ', 'ቷ', 'ሯ', 'ሷ', 'ሿ', 'ቋ',
        'ቧ', 'ቿ', 'ኋ', 'ኗ', 'ኟ', 'ኳ', 'ዟ',
        'ጓ', 'ዷ', 'ጧ', 'ጯ', 'ጿ', 'ፏ', 'ቁ', 'ኩ',
        ' '
    ]

    for pattern, replacement in zip(specific_patterns, replacements_specific):
        norm = re.sub(pattern, replacement, norm)


    return norm

In [11]:
normalized_data = [normalize(token) for token in processed_tokenized_data]

In [12]:
# print processed_tokenized_data vs normalized_data
print("processed_tokenized\t|normalized")
print("-" * 40)
for i in range(1000):
    if processed_tokenized_data[i] != normalized_data[i]:
        print("{:25} {:25}".format(processed_tokenized_data[i], normalized_data[i]))

processed_tokenized	|normalized
----------------------------------------
ሃኒ                        ሀኒ                       
ሃኒም                       ሀኒም                      
ሃኒ                        ሀኒ                       
ሃኒም                       ሀኒም                      
ሃኒም                       ሀኒም                      
ለሦስት                      ለሶስት                     
ሃኒ                        ሀኒ                       
ሃትሪክ                      ሀትሪክ                     
ሃኒ                        ሀኒ                       
ሃኒን                       ሀኒን                      
ያፀደቀለት                    ያጸደቀለት                   
ሃኒ                        ሀኒ                       
ሃኒ                        ሀኒ                       
ሃኒ                        ሀኒ                       
ሃኒ                        ሀኒ                       
ሃኒ                        ሀኒ                       
ሃኒ                        ሀኒ                       
ሃኒ                        ሀኒ               

## #1. N-gram language model
### 1.1. Create n-grams for n=1, 2, 3, 4.

In [13]:
# define a function to create n-grams
def create_n_grams(tokens: list, n: int):
    n_grams = []
    for i in range(len(tokens) - n + 1):
        n_grams.append(tokens[i:i + n])

    return n_grams

#### Unigrams (n=1)

In [14]:
# create n-grams for n=1
unigrams = create_n_grams(normalized_data, 1)

In [15]:
# print the first 5 unigrams
unigrams[:5]

[['ምን'], ['መሰላችሁ'], ['?'], ['አንባቢያን'], ['ኢትዮጵያ']]

#### Bigrams (n=2)

In [16]:
# create n-grams for n=2
bigrams = create_n_grams(normalized_data, 2)

In [17]:
# print the first 5 bigrams
bigrams[:5]

[['ምን', 'መሰላችሁ'],
 ['መሰላችሁ', '?'],
 ['?', 'አንባቢያን'],
 ['አንባቢያን', 'ኢትዮጵያ'],
 ['ኢትዮጵያ', 'በተደጋጋሚ']]

#### Trigrams (n=3)

In [18]:
# create n-grams for n=3
trigrams = create_n_grams(normalized_data, 3)

In [19]:
# print the first 5 trigrams
trigrams[:5]

[['ምን', 'መሰላችሁ', '?'],
 ['መሰላችሁ', '?', 'አንባቢያን'],
 ['?', 'አንባቢያን', 'ኢትዮጵያ'],
 ['አንባቢያን', 'ኢትዮጵያ', 'በተደጋጋሚ'],
 ['ኢትዮጵያ', 'በተደጋጋሚ', 'ጥሪው']]

#### Fourgrams (n=4)

In [20]:
# create n-grams for n=4
fourgrams = create_n_grams(normalized_data, 4)

In [21]:
# print the first 5 fourgrams
fourgrams[:5]

[['ምን', 'መሰላችሁ', '?', 'አንባቢያን'],
 ['መሰላችሁ', '?', 'አንባቢያን', 'ኢትዮጵያ'],
 ['?', 'አንባቢያን', 'ኢትዮጵያ', 'በተደጋጋሚ'],
 ['አንባቢያን', 'ኢትዮጵያ', 'በተደጋጋሚ', 'ጥሪው'],
 ['ኢትዮጵያ', 'በተደጋጋሚ', 'ጥሪው', 'ደርሷት']]

### 1.2. Calculate probabilities of n-grams and find the top 10 most likely n-grams for all n.
#### Probability of n-grams (n=1, 2, 3, 4)

In [22]:
# calculate n-gram probabilities for n=1
def calculate_unigram_probabilities(unigrams: list):
    unigram_counts = {}
    for unigram in unigrams:
        unigram_tuple = tuple(unigram)  # Convert list to tuple
        if unigram_tuple in unigram_counts:
            unigram_counts[unigram_tuple] += 1
        else:
            unigram_counts[unigram_tuple] = 1

    unigram_probabilities = {}
    total_unigrams = len(unigrams)
    for unigram, count in unigram_counts.items():
        unigram_probabilities[unigram] = count / total_unigrams

    return unigram_probabilities

In [23]:
# calculate unigram probabilities
unigram_probabilities = list(calculate_unigram_probabilities(unigrams).items())

In [24]:
# calculate n-gram probabilities for n=2
def calculate_bigram_probabilities(bigrams: list):
    bigram_counts = {}
    for bigram in bigrams:
        bigram_tuple = tuple(bigram)  # Convert list to tuple
        if bigram_tuple in bigram_counts:
            bigram_counts[bigram_tuple] += 1
        else:
            bigram_counts[bigram_tuple] = 1

    bigram_probabilities = {}
    total_bigrams = len(bigrams)
    for bigram, count in bigram_counts.items():
        bigram_probabilities[bigram] = count / total_bigrams

    return bigram_probabilities

In [25]:
# calculate bigram probabilities
bigram_probabilities = list(calculate_bigram_probabilities(bigrams).items())

In [26]:
# calculate n-gram probabilities for n=3
def calculate_trigram_probabilities(trigrams: list):
    trigram_counts = {}
    for trigram in trigrams:
        trigram_tuple = tuple(trigram)  # Convert list to tuple
        if trigram_tuple in trigram_counts:
            trigram_counts[trigram_tuple] += 1
        else:
            trigram_counts[trigram_tuple] = 1

    trigram_probabilities = {}
    total_trigrams = len(trigrams)
    for trigram, count in trigram_counts.items():
        trigram_probabilities[trigram] = count / total_trigrams

    return trigram_probabilities

In [27]:
# calculate trigram probabilities
trigram_probabilities = list(calculate_trigram_probabilities(trigrams).items())

In [28]:
# calculate n-gram probabilities for n=4
def calculate_fourgram_probabilities(fourgrams: list):
    fourgram_counts = {}
    for fourgram in fourgrams:
        fourgram_tuple = tuple(fourgram)  # Convert list to tuple
        if fourgram_tuple in fourgram_counts:
            fourgram_counts[fourgram_tuple] += 1
        else:
            fourgram_counts[fourgram_tuple] = 1

    fourgram_probabilities = {}
    total_fourgrams = len(fourgrams)
    for fourgram, count in fourgram_counts.items():
        fourgram_probabilities[fourgram] = count / total_fourgrams

    return fourgram_probabilities

In [29]:
# calculate fourgram probabilities
fourgram_probabilities = list(calculate_fourgram_probabilities(fourgrams).items())

In [30]:
# print the first 5 probabilities for all n

print("Unigram probabilities:\n")
unigram_probabilities[:5]

Unigram probabilities:



[(('ምን',), 0.0017841831978120792),
 (('መሰላችሁ',), 4.931475586740991e-05),
 (('?',), 0.005905496433072657),
 (('አንባቢያን',), 1.2008227703859256e-05),
 (('ኢትዮጵያ',), 0.0012402818308065628)]

In [31]:
print("Bigram probabilities:\n")
bigram_probabilities[:5]

Bigram probabilities:



[(('ምን', 'መሰላችሁ'), 1.4729125397956964e-05),
 (('መሰላችሁ', '?'), 2.1755087513074367e-05),
 (('?', 'አንባቢያን'), 4.837151198015423e-08),
 (('አንባቢያን', 'ኢትዮጵያ'), 3.627863398511567e-08),
 (('ኢትዮጵያ', 'በተደጋጋሚ'), 6.167367777469664e-07)]

In [32]:
print("Trigram probabilities:\n")
trigram_probabilities[:5]

Trigram probabilities:



[(('ምን', 'መሰላችሁ', '?'), 7.932928060677225e-06),
 (('መሰላችሁ', '?', 'አንባቢያን'), 1.2092878141276258e-08),
 (('?', 'አንባቢያን', 'ኢትዮጵያ'), 1.2092878141276258e-08),
 (('አንባቢያን', 'ኢትዮጵያ', 'በተደጋጋሚ'), 2.4185756282552517e-08),
 (('ኢትዮጵያ', 'በተደጋጋሚ', 'ጥሪው'), 2.4185756282552517e-08)]

In [33]:
print("Fourgram probabilities:\n")
fourgram_probabilities[:5]

Fourgram probabilities:



[(('ምን', 'መሰላችሁ', '?', 'አንባቢያን'), 1.2092878287513962e-08),
 (('መሰላችሁ', '?', 'አንባቢያን', 'ኢትዮጵያ'), 1.2092878287513962e-08),
 (('?', 'አንባቢያን', 'ኢትዮጵያ', 'በተደጋጋሚ'), 1.2092878287513962e-08),
 (('አንባቢያን', 'ኢትዮጵያ', 'በተደጋጋሚ', 'ጥሪው'), 2.4185756575027924e-08),
 (('ኢትዮጵያ', 'በተደጋጋሚ', 'ጥሪው', 'ደርሷት'), 2.4185756575027924e-08)]

#### Top 10 most likely n-grams (n=1, 2, 3, 4)

In [34]:
# find the top 10 most likely n-grams for n=1
sorted_up = sorted(unigram_probabilities, key=lambda x: x[1], reverse=True)

print("Top 10 most likely unigrams:")
for i in range(10):
    print("{:25} {:25}".format(sorted_up[i][0][0], sorted_up[i][1]))


Top 10 most likely unigrams:
።                              0.037662663416197846
፡፡                             0.016365279499904357
ነው                             0.014102677868638018
ላይ                             0.009125249346071583
?                              0.005905496433072657
ውስጥ                           0.0041903635677772305
ወደ                             0.004002416060251168
እና                            0.0038550401579078315
ጋር                             0.003525327843360276
ነበር                           0.0033258195446107595


In [35]:
# find the top 10 most likely n-grams for n=2
sorted_bp = sorted(bigram_probabilities, key=lambda x: x[1], reverse=True)

print("Top 10 most likely bigrams:")
for i in range(10):
    print("{:25} {:25}".format(sorted_bp[i][0][0] + " " + sorted_bp[i][0][1], sorted_bp[i][1]))


Top 10 most likely bigrams:
ነው ።                            0.00538327766114936
ነው ፡፡                          0.002615121145061073
ነበር ።                         0.0016460704598066536
? ?                           0.0012708647342545923
አ ም                           0.0012682405797296687
ናቸው ።                         0.0008217594312748502
ነው ?                          0.0007919263012610901
። ይህ                          0.0007865328776753028
ነበር ፡፡                        0.0006848438666150236
ነገር ግን                        0.0005411079187659953


In [36]:
# find the top 10 most likely n-grams for n=3
sorted_tp = sorted(trigram_probabilities, key=lambda x: x[1], reverse=True)

print("Top 10 most likely trigrams:")
for i in range(10):
    print("{:25} {:25}".format(sorted_tp[i][0][0] + " " + sorted_tp[i][0][1] + " " + sorted_tp[i][0][2], sorted_tp[i][1]))


Top 10 most likely trigrams:
ነው ? ?                         0.000263781950895659
። ነገር ግን                     0.00021120211673738984
እ ኤ አ                         0.0002026161732570837
ማለት ነው ።                     0.00018153828665683918
። ይሁን እንጂ                    0.00014940750943546818
ነው ። ይህ                      0.00014396571427189385
እንዴት ነው ?                    0.00013570627850140217
፡፡ ነገር ግን                     0.0001336021177048201
ብቻ ነው ።                       0.0001221380692268902
አ ም ኢሳት                      0.00011814741944026905


In [37]:
# find the top 10 most likely n-grams for n=4
sorted_fp = sorted(fourgram_probabilities, key=lambda x: x[1], reverse=True)

print("Top 10 most likely fourgrams:")
for i in range(10):
    print("{:25} {:25}".format(sorted_fp[i][0][0] + " " + sorted_fp[i][0][1] + " " + sorted_fp[i][0][2] + " " + sorted_fp[i][0][3], sorted_fp[i][1]))

Top 10 most likely fourgrams:
አ ም ኢሳት ዜና                   0.00011705906182313514
እንዴት ነው ? ?                  5.8735109842455314e-05
? ? ? ?                       5.455097395497548e-05
ቀን 2008 አ ም                  4.6714788824666435e-05
ለምንድን ነው ? ?                 4.6412466867478585e-05
ቀን 2007 አ ም                  4.0740906950634534e-05
ቀን 2010 አ ም                   4.026928469742149e-05
ቀን 2011 አ ም                   3.922929716469529e-05
ምንድን ነው ? ?                  3.8419074319431854e-05
ቀን 2012 አ ም                   3.488795385947778e-05


### 1.3. Probability of a given sentence

In [38]:
# find the probability of a given sentence
def sentence_probability(sentence, n):
    words = sentence.split()
    total_probability = 1.0

    if n == 1:
        n_grams = unigram_probabilities
        sentence_grams = [tuple([(words[i]) for i in range(len(words))])]
    elif n == 2:
        n_grams = bigram_probabilities
        sentence_grams = [(words[i], words[i+1]) for i in range(len(words) - 1)]
    elif n == 3:
        n_grams = trigram_probabilities
        sentence_grams = [(words[i], words[i+1], words[i+2]) for i in range(len(words) - 2)]
    elif n == 4:
        n_grams = fourgram_probabilities
        sentence_grams = [(words[i], words[i+1], words[i+2], words[i+3]) for i in range(len(words) - 3)]
    else:
        raise ValueError("Sentence length must be between 1 and 4")

    for gram in sentence_grams:
        found = False
        for xgram, prob in n_grams:
            if gram == xgram:
                total_probability *= prob
                found = True
                break
        if not found:
            total_probability *= 0.0  # Assume a default probability for unseen four-grams (smoothing)

    return total_probability

In [39]:
# calculate the probability of a sentence for sentence
sent = "ኢትዮጵያ ታሪካዊ ሀገር ናት"

probability = sentence_probability(sent, len(sent.split()))

print(f"The probability of the sentence '{sent}' is: {probability}")

The probability of the sentence 'ኢትዮጵያ ታሪካዊ ሀገር ናት' is: 1.2092878287513962e-08


### 1.4. Generate random sentences using the n-grams for n=1, 2, 3, 4

In [40]:
# generate random sentences using the n-grams for n=1, 2, 3, 4
def generate_random_sentence(n):
    if n == 1:
        n_grams = unigram_probabilities
    elif n == 2:
        n_grams = bigram_probabilities
    elif n == 3:
        n_grams = trigram_probabilities
    elif n == 4:
        n_grams = fourgram_probabilities
    else:
        raise ValueError("Sentence length must be between 1 and 4")

    sentence = ""
    # generating words until we get a sentence ending punctuation mark (።, ፡፡, ?)
    while True:
        if n == 1:
            sentence += random.choices([x[0][0] for x in n_grams], [x[1] for x in n_grams])[0] + " "
        elif n == 2:
            sentence += random.choices([x[0][0] + " " + x[0][1] for x in n_grams], [x[1] for x in n_grams])[0] + " "
        elif n == 3:
            sentence += random.choices([x[0][0] + " " + x[0][1] + " " + x[0][2] for x in n_grams], [x[1] for x in n_grams])[0] + " "
        elif n == 4:
            sentence += random.choices([x[0][0] + " " + x[0][1] + " " + x[0][2] + " " + x[0][3] for x in n_grams], [x[1] for x in n_grams])[0] + " "

        if sentence.startswith('። ') or sentence.startswith('፡፡ ') or sentence.startswith('? '):
            sentence = sentence[2:]

        if sentence.endswith('። ') or sentence.endswith('፡፡ ') or sentence.endswith('? '):
            break

    return sentence

In [41]:
# generate sentences for n_grams, n=1, 2, 3, 4
unigram_sentence = generate_random_sentence(1)
bigram_sentence = generate_random_sentence(2)
trigram_sentence = generate_random_sentence(3)
fourgram_sentence = generate_random_sentence(4)

print("\033[1m \033[92m {} \033[00m {}".format("\nUnigram sentence: \n", unigram_sentence))
print("\033[1m \033[92m {} \033[00m {}".format("\nBigram sentence: \n", bigram_sentence))
print("\033[1m \033[92m {} \033[00m {}".format("\nTrigram sentence: \n", trigram_sentence))
print("\033[1m \033[92m {} \033[00m {}".format("\nFourgram sentence: \n", fourgram_sentence))

[1m [92m 
Unigram sentence: 
 [00m አውቃለሁ አመታት የክርስቲያን እንቅስቃሴ ቀን መአዛ እየፈጠረ ክፍያ እየሄደች የገለጸውን እንዳለበት እስኪ ወይም የፖለቲካ ጎምቱ ተቋርጦ ይልቁንም ነበረች የለቀቁት የነፃነት ጀመር መስጠቱን ፓልሚራ ህሊናቸው በጐ አማራጭ ዛሬም አስከባሪ ባልተከፋፈለ የጊዜና ይገልፃሉ ጉዲት ለመግታት ብንጀምርስ ስደተኞች ህጉ ነን ታይቷል የአለም ። 
[1m [92m 
Bigram sentence: 
 [00m አይታጠቡም ። 
[1m [92m 
Trigram sentence: 
 [00m ጥቂት ቀናት አስቀድመው ይህ ተገቢውን ግንዛቤ በስተቀር ውጤቱ ገንዘብ ቤተሰቦቹ እንዳይጠይቁት መደረጉን ዉል በስራ መመሪያ ግን አሉበት ፡፡ 
[1m [92m 
Fourgram sentence: 
 [00m ተግተው በሚሰሩት በእነዚህ ወንድሞች መብት ተሟጋች ድርጅቶች በኢትዮጵያ ወያላው ጀመረ ፡፡ ታዲያ በፍራንስ ፉትቦል መጽሄትና ድረገጾች ፈጣንና ቀለጣፋ ምላሽ መስጠት ኢትዮጵያን ከውድቀት ለመታደግም ሆነ ጉልህ ፋይዳ ይኖረዋል ። 


#### Explanation:
* **Unigram sentence:** The unigram sentence is not coherent and does not make sense. This is because the unigram model does not consider the context of the words. It only considers the probability of each word in the sentence. Therefore, the unigram model is not suitable for generating sentences.
* **Bigram sentence:** The bigram sentence is more coherent than the unigram sentence. This is because the bigram model considers the probability of each word given the previous word. Therefore, the bigram model is more suitable for generating sentences than the unigram model.
* **Trigram sentence:** The trigram sentence is more coherent than the bigram sentence. This is because the trigram model considers the probability of each word given the previous two words. Therefore, the trigram model is more suitable for generating sentences than the bigram model.
* **Fourgram sentence:** The fourgram sentence is more coherent than the trigram sentence. This is because the fourgram model considers the probability of each word given the previous three words. Therefore, the fourgram model is more suitable for generating sentences than the trigram model.
* **Conclusion:** The higher the n-gram model, the more coherent the generated sentence is. However, the higher the n-gram model, the more data is required to train the model. Therefore, the n-gram model should be chosen based on the available data.

## #2. Evaluate these Language Models Using Intrinsic Evaluation Method
#### Calculate perplexity for n_gram language models for n=1, 2, 3, 4

In [42]:
# define a function to calculate perplexity
def calculate_perplexity(ngrams, total_words):

    model_entropy = 0.0
    for ngram, count in ngrams.items():
        prob = count / total_words
        model_entropy += -math.log2(prob)

    model_entropy /= total_words
    return math.pow(2, model_entropy)

def evaluate_model(n, tokens):

    ngrams = collections.Counter(tuple(tokens[i:i + n]) for i in range(len(tokens) - n + 1))
    total_words = len(tokens)

    return calculate_perplexity(ngrams, total_words)

In [43]:
# Evaluating n-gram models for n = 1, 2, 3, 4
perplexities = []
for n in range(1, 5):
    perplexity = evaluate_model(n, normalized_data)
    perplexities.append(perplexity)

# Printing the results
for n, perplexity in enumerate(perplexities):
    print(f"n = {n + 1}, Perplexity: {perplexity}")

n = 1, Perplexity: 1.466194395820576
n = 2, Perplexity: 192.72106771103458
n = 3, Perplexity: 39371.4665688541
n = 4, Perplexity: 513550.1090238023


#### Explanation:
* In general, higher n-gram models (such as trigrams, fourgrams) have higher perplexity compared to lower order n-grams (unigrams, bigrams) often indicate potential challenges or limitations in language model performance. Higher perplexity values imply higher uncertainty and poorer predictions made by the language model.

## #3 Evaluate these Language Models Using Extrinsic Evaluation Method

In [44]:
# create a language model using the n-gram model for n=1, 2, 3, 4
def create_language_model(n, tokens):
    if n == 1:
        n_grams = unigram_probabilities
    elif n == 2:
        n_grams = bigram_probabilities
    elif n == 3:
        n_grams = trigram_probabilities
    elif n == 4:
        n_grams = fourgram_probabilities
    else:
        raise ValueError("Sentence length must be between 1 and 4")

    language_model = {}
    for ngram, prob in n_grams:
        if n == 1:
            language_model[ngram[0]] = prob
        elif n == 2:
            language_model[ngram[0] + " " + ngram[1]] = prob
        elif n == 3:
            language_model[ngram[0] + " " + ngram[1] + " " + ngram[2]] = prob
        elif n == 4:
            language_model[ngram[0] + " " + ngram[1] + " " + ngram[2] + " " + ngram[3]] = prob

    return language_model

In [45]:
# create language models for n=1, 2, 3, 4
unigram_language_model = create_language_model(1, normalized_data)
bigram_language_model = create_language_model(2, normalized_data)
trigram_language_model = create_language_model(3, normalized_data)
fourgram_language_model = create_language_model(4, normalized_data)

In [46]:
def evaluate_language_model(language_model, sentences):
    total_probability = 0.0

    for sentence in sentences:
        words = sentence.split()

        for i in range(len(words)):
            phrase = " ".join(words[i - i:i + 1])

            if phrase in language_model:
                total_probability += language_model[phrase]
            else:
                total_probability += 0.0

    return total_probability


In [47]:
sentences = ["ኢትዮጵያ በተደጋጋሚ ጥሪው ደርሷት ልትታደመው"]
unigram_probability = evaluate_language_model(unigram_language_model, sentences)
bigram_probability = evaluate_language_model(bigram_language_model, sentences)
trigram_probability = evaluate_language_model(trigram_language_model, sentences)
fourgram_probability = evaluate_language_model(fourgram_language_model, sentences)

In [48]:
# print the results
print("Sentence Unigram probability: ", unigram_probability)
print("Sentence Bigram probability: ", bigram_probability)
print("Sentence Trigram probability: ", trigram_probability)
print("Sentence Fourgram probability: ", fourgram_probability)

Sentence Unigram probability:  0.0012402818308065628
Sentence Bigram probability:  6.167367777469664e-07
Sentence Trigram probability:  2.4185756282552517e-08
Sentence Fourgram probability:  2.4185756575027924e-08


In [49]:
# Display total execution time of the entire program
end = time.time()
print(f"Total execution time = {end - start} seconds.")

Total execution time = 7552.461581230164 seconds.


#### Explanation:
* In general the higher the n-gram model, the more coherent the generated sentence is. I calculated the probabilities of test sentences under each n-gram model. The assignment of probabilities varied depending on the sentence contents and how well it matched patterns in the training data. Some sentences may have received higher probabilities from higher n models, while others could be better predicted by lower n models. No definitive claim can be made that probabilities will consistently increase or decrease with n.

### Conclusion:
* I have dedicated considerable effort to completing this assignment, engaging in discussions with my peers, particularly Debela, to address the various aspects of the tasks. Our collaborative efforts have been beneficial in broadening our understanding and learning significantly throughout this process