# English orthotactics using Statistical Language Models

In this assignment you will use SUBTLEX-US, a popular lexicon which provides frequency counts extracted from movie subtitles, to train type- and token-level Statistical Language Models of different n-gram size to then score English words according to their probability, generate likely English words, and score foreign words from different languages, retrieved from the NorthEuraLex dataset, to see which languages may be more closely related to English from a philogenetic point of view. After solving each task, you are asked to comment on the results: when grading comments, partial credit is always available.

Make sure to import all libraries you might need to solve the tasks: I've only imported the libraries required to instantiate the language model.

## TASK 1
Fetch SUBTLEX-US and load it in a Pandas df (make sure to specify how to read the string 'null'!!). Show the first 10 rows and the total number of rows. Also provide the link where you found and accessed SUBTLEX-US at the bottom of this cell.


In [1]:
# load SUBTLEX-US HERE
import pandas as pd
subtlex_df = pd.read_csv("http://www.lexique.org/databases/SUBTLEX-US/SUBTLEXus74286wordstextversion.tsv", sep="\t")

subtlex_df = subtlex_df.dropna()
print("SUBTLEX-US:", "\n", subtlex_df[:10], "\n")
print("Total number of rows: ", len(subtlex_df))


# Downloaded from: http://www.lexique.org/databases/SUBTLEX-US/SUBTLEXus74286wordstextversion.tsv

SUBTLEX-US: 
   Word  FREQcount  CDcount  FREQlow  Cdlow   SUBTLWF  Lg10WF  SUBTLCD  Lg10CD
0  the    1501908     8388  1339811   8388  29449.18  6.1766   100.00  3.9237
1   to    1156570     8383  1138435   8380  22677.84  6.0632    99.94  3.9235
2    a    1041179     8382   976941   8380  20415.27  6.0175    99.93  3.9234
3  you    2134713     8381  1595028   8376  41857.12  6.3293    99.92  3.9233
4  and     682780     8379   515365   8374  13387.84  5.8343    99.89  3.9232
5   it     963712     8377   685089   8370  18896.31  5.9839    99.87  3.9231
6    s    1057301     8377  1052788   8373  20731.39  6.0242    99.87  3.9231
7   of     590439     8375   573021   8372  11577.24  5.7712    99.85  3.9230
8  for     351650     8374   332686   8370   6895.10  5.5461    99.83  3.9230
9    I    2038529     8372     5147    350  39971.16  6.3093    99.81  3.9229 

Total number of rows:  74285


## TASK 2
Comment the code below completing the docstrings (make sure to specify the appropriate data type of every input argument) and adding inline comments wherever you find a hash (#). At the end of each docstring provide a short summary of what each method or function does (max 2 lines); inline comments should be short and ideally limited to a single line, but if you need to, you can make them multiline.
Note: I'm asking you to heavily comment the code to make sure you understand: when you code, try to use variable names to give a clear indication of what happens so not to overload the code with comments.


In [2]:
import operator
import numpy as np
from collections import defaultdict

class LM(object):
    
    
    
    def __init__(self, df, ngram_size=2, bos='+', eos='#', k=0.01, 
                 key_col='Word', value_col='FREQcount', tokens=True):
        
        """
        :param df: dataframe, so the SUBTLEX-US data
        :param ngram_size: the n-gram is a bigram (consists of two words)
        :param k: the number used for smoothing to avoid zeros
        :param bos: beginning of sequence
        :param eos: end of sequence 
        
        This class creates an LM object with attributes that define that data, the ngram-size, the beginning and end of sequence
        symbols, the number used for smoothing, a variable which specifies the column with the word data, a number which specifies
        the column for the word frequencies, and a variable with defines whether the model will be token-based or type-based.
        """
        
        self.k = k
        self.ngram_size = ngram_size
        self.bos = bos
        self.eos = eos
        self.df = df
        self.tokens = tokens
        self.types2freq = self.map_types2frequency(key_col, value_col)
        self.vocab = self.get_vocab()
        self.vocab.add(self.eos)
        self.vocab_size = len(self.vocab)
        
    def map_types2frequency(self, key_col, value_col):
        
        """
        :param target_col: target_col does not exist, so I will explain key_col and value_col
        key_col is the column in the df with all of the words, value_col is the column with their frequencies
        :return: a dictionary of the word types and their frequencies
        
        This method derives each word type and its frequencies. The word types are extracted from the appropriate column without 
        the axis label and along with their appropriate frequency counts they are first converted to a pandas series then to a dict
        """
        
        return pd.Series(self.df[value_col].values, index=self.df[key_col]).to_dict()
        
    def get_vocab(self):
        
        """
        :return: a set
        
        This method derives the lower-case characters which exist in all of the words in the lexicon. It creates a set
        in which each character that exists in the lexicon is only listed once.
        """
                
        return set(char for string in self.types2freq.keys() for char in string.lower())
        
    def update_counts(self):
        
        """
        :return: nothing gets returned but the dictionary "counts" is created
        
        This method creates a dictionary called counts which contains all of the ngrams of all of the words, along with 
        their frequencies
        """
        
        # r is used later to specify how many chars of bos are in string. if ngram is a unigram, r=1, else, r=ngram_size-1
        r = 1 if self.ngram_size == 1 else self.ngram_size - 1
        
        # counts is a dictionary, if ngram is a unigram, counts is DefaultDict (never raises error), if not, normal dictionary
        self.counts = defaultdict(dict) if self.ngram_size > 1 else dict()
        
        # For each word, ngrams are created and added to a new dictionary as keys, along with their frequencies as values
        for word_type, freq in self.types2freq.items():
            
            # if tokens = True, count=the value in the types2freq dict (the frequency), otherwise count is 1
            count = freq if self.tokens else 1
            
            # this creates a string which consists of the beginning of sequence, so "+", once if the n-grams are unigrams or bigrams,
            # otherwise the "+" is in the string n_gram.size-1 times, then the word in lower-case, then the end-of sequence "#"
            string = self.bos*r + word_type.lower() + self.eos
            
            # In this loop, !! have to understand get_ngram first !!
            for idx in range(self.ngram_size-1, len(string)):
                
                ngram = self.get_ngram(string, idx)
                
                if self.ngram_size == 1:
                    self.counts[ngram] += count
                else:
                    # it tries to add the frequency "count" to the value of the specific ngram, assuming its key (the ngram itself) already exists
                    # in the dict, but if it doesn't (KeyError detected), it gets added as a key to the dict with
                    # its count frequency as a value
                    try:
                        self.counts[ngram[0]][ngram[1]] += count
                    except KeyError:
                        self.counts[ngram[0]][ngram[1]] = count
                        
    def get_ngram(self, string, i):
        
        """
        :param string: the string consisting of bos, the lower-case word and eos which was created in update_counts
        :param i: a number correlating to an index (a charachter) in the string, can range between ngram_size-1 and the len(string)-1
        :return: if ngram_size=1, returns ngram, else returns tuple of history and target (ngram split in two parts)
        
        This method creates an ngram of the target character specified in string[i] in the update_counts method, including a number
        of preceding characters as history (ngram_size-1 history chars). It returns the ngram (as a tuple of target and history if ngram_size>1)
        """
        
        if self.ngram_size == 1:
            # if the ngram is a unigram, returns the single character of the string that index i indicates
            return string[i]
        else:
            # Here, the ngram is set as a string consisting of ngram_size number of characters, so ngram_size-1 characters before string[i],
            # and string[i]. history is everything in the ngram before string[i]. target is string[i]. this returns history and target,
            # so basically the ngram but split up into history and target (here both have len=1 because ngram_size=2)
            ngram = string[i-(self.ngram_size-1):i+1]
            history = tuple(ngram[:-1])
            target = ngram[-1]
            return (history, target)
    
    def get_unigram_probability(self, ngram):
        
        """
        :param ngram: the unigram for which the probability should be calculated
        :return: a number that indicates the probability of the unigram "ngram"
        
        This method first derives the total number of all unigrams in the data (adding smoothing), then
        the count of the occurrence of the specific unigram in question and divides them to get unigram probability
        """
        
        # this is the total number of all unigrams (not how many individual, different ones but how many frequencies),
        # so all of their frequencies added up as well as 1% of the number of items in the vocab set
        tot = sum(list(self.counts.values())) + (self.vocab_size*self.k)
        
        # here, it tries to count the occurences of the specific unigram in question, so it tries to get its
        # frequency from the counts dictionary + 0.01 for smoothing. If it doesn't work because the unigram doesn't exist (too rare), 
        # the count is k (the smoothing index k, so 0.01)
        try:
            ngram_count = self.counts[ngram] + self.k
        except KeyError:
            ngram_count = self.k
        
        return ngram_count/tot
    
    def get_ngram_probability(self, history, target):
        
        """
        :param history: the characters just before the ngram target, so ngram[:-1]
        :param target: the target of ngram, ngram[-1]
        :return: a number that indicates the probability of the ngram
        
        This method tries to calculate the probability of the whole ngram given the history. If either only the
        target doesn't exist or even both the target and the history don't exist, the appropriate numbers are instead calculated with smoothing k
        """
        
        try:
            # This is the total sum of all frequencies of the history of the ngrams and 1% of the number of items in the
            # vocab set for smoothing. The history is considered because the ngram probability is calculated
            # as the probability of the target given the history of the ngram
            ngram_tot = np.sum(list(self.counts[history].values())) + (self.vocab_size*self.k)
            # Here, it tries to get the frequencies of the specific ngram (both history & target) plus k for smoothing
            # if the ngram doesn't exist as key, there is a KeyError
            try:
                transition_count = self.counts[history][target] + self.k
            except KeyError:
                # KeyError because the ngram is too rare, k (0.01) is set as the ngram's frequency
                transition_count = self.k
        except KeyError:
            # If not even the history of the ngram can be found as a key, the frequency of the ngram
            # transition_count is set to k and ngram_tot is the number of items in the vocab set times the smoothing k
            transition_count = self.k
            ngram_tot = self.vocab_size*self.k
        
        # return the probability of finding the target given the history
        return transition_count/ngram_tot 
    
    def perplexity(self, string):
        
        """
        :param string: This is the word (or sequence of characters) of which we want to calculate the perplexity
        :return: the score of the perplexity of the string
        
        This method computes the perplexity formula for a given string by creating an ngram, 
        getting each necessary probabiility and then computing the specific steps of the perplexity formula
        """
        
        r = 1 if self.ngram_size == 1 else self.ngram_size - 1
        string = self.bos*r + string + self.eos
        

        probs = []
        for idx in range(self.ngram_size-1, len(string)):
            ngram = self.get_ngram(string, idx)
            # If the ngram is a unigram, its probability is added to the list probs with the method get_unigram_probability,
            # if it's not a unigram, the method get_ngram_probability is used
            if self.ngram_size == 1:
                probs.append(self.get_unigram_probability(ngram))
            else:
                probs.append(self.get_ngram_probability(ngram[0], ngram[1]))
                    
        # The logarithmic entropy of the probability is calculated. This is the first step of the perplexity formula
        entropy = np.log2(probs)
        
        # This is the next step of the perplexity formula. The sum of all entropies is divided by the number of probabilities in probs, 
        # so ngram_size. It is also multiplied by -1 because it must be negative.
        avg_entropy = -1 * (sum(entropy) / len(entropy))
        
        # The final step in the formula for perplexity is 2 to the power of the average entropy (normally the negative,
        # but we have already multiplied it by -1 in the previous step)
        return pow(2.0, avg_entropy)
    
    def generate(self, limit=50, max_prob=True):
    
        """
        I think this generates the likeliest string?
        Yeah so we generate language
        
        :param limit: The string we generate shouldn't be longer than our limit of 50 characters
        :param max_prob: If true, will always generate the character with the highest probability, if false, will generate a random
        character, but still taking the probabilities into account (more likely -> higher likelihood of being selected)
        :return: The generated word
        
        This method generates and returns a word by getting the highest probability for an ngram target given its history 
        (or if max_prob=False, by choosing a random new character usually with a high likelihood)
        """
    
        i = 0
        r = 1 if self.ngram_size == 1 else self.ngram_size - 1
        # The word that is to be generated is started off as a list of bos characters (1 for unigrams, otherwise ngram_size-1)
        word = [self.bos]*r
        # The first "current" is just the bos, in general current is the history
        current = word[-(self.ngram_size-1):]
            
        while i < limit:

            characters = []
            probabilities = []
            # continuations is a dictionary with the unigrams and their counts or with the current history and their counts
            continuations = self.counts if self.ngram_size == 1 else self.counts[tuple(current)]
            if max_prob:
                # The ngram with the highest frequency is selected as "new" for the newly appended character in "word"
                new = sorted(
                    continuations.items(), key=operator.itemgetter(1), reverse=True)[0][0]
            else:
                # If the goal is not to select the character with the highest probability: first, tot is the sum of
                # all frequencies in continuations
                tot = sum(list(continuations.values()))
                # All of the characters (ngrams) in continuations are appended to the list characters, and
                # their probabilities (individual count / total frequencies of all chars in continuations) are appended to probabilities list
                for char, v in continuations.items():
                    characters.append(char)
                    probabilities.append(v/tot)
        
                # Generates a random new character, taking the probabilities into account (using probability distribution)
                new = np.random.choice(characters, size=1, p=probabilities)[0]
        
            # If the newly generated character is eos, the word is finished and returned without bos
            # If not, the new character is appended to word
            if new != self.eos: 
                word.append(new) 
            else: 
                return ''.join(word[self.ngram_size-1:])
        
            # "current" is "word" without word[0], so the history for the new ngram
            current = word[-(self.ngram_size-1):]   
            i += 1
    
        # When the limit has been reached, the word is returned without bos
        return ''.join(word[self.ngram_size-1:])


## TASK 3
Train the following language models:
- 3gram, type based
- 3gram, token based
- 4gram, type based
- 4gram, token based

For each language model, show the vocabulary size, the number of history ngrams as well as 5 sample history ngrams, and the perplexity of the word 'computational'. Keep default hyper-parameters of the language model as far as bos, eos, key_col, value_col, and k are concerned. 

In [4]:
# instantiate the 3gram, type-based LM here and show the required details
from random import randint

# Initiating the model
lm1 = LM(subtlex_df, ngram_size=3, tokens=False)

# Show the vocabulary size
print("Vocabulary size: ", lm1.vocab_size, "\n")

lm1.update_counts()
# Show number of history ngrams
print("Number of history ngrams: ", len(list(lm1.counts)), "\n")

# Show history states
for _ in range(5):
    history = randint(0, len(list(lm1.counts)))
    print("Ngrams and frequencies for history {}:".format(list(lm1.counts.items())[history][0]))
    print(list(lm1.counts.items())[history][1:], "\n")

# Find perplexity for "computational"
print("Perplexity for \"computational\": ", lm1.perplexity("computational"))

Vocabulary size:  27 

Number of history ngrams:  627 

Ngrams and frequencies for history ('d', 'c'):
({'a': 29, 'u': 8, 'h': 10, '#': 2, 'o': 10, 'r': 4, 'l': 4},) 

Ngrams and frequencies for history ('g', 's'):
({'#': 526, 't': 31, 'i': 3, 'h': 15, 'a': 4, 'k': 3, 'p': 2, 'o': 2, 'b': 1, 'g': 1, 'l': 2, 'm': 1, 'w': 1},) 

Ngrams and frequencies for history ('d', 'o'):
({'#': 91, 'n': 127, 'w': 173, 'i': 16, 'e': 26, 'o': 78, 'c': 80, 'g': 70, 'u': 84, 'l': 84, 'z': 17, 'm': 124, 'p': 35, 'r': 120, 's': 59, 'd': 22, 't': 30, 'v': 13, 'x': 12, 'a': 5, 'f': 12, 'k': 3, 'b': 12, 'j': 2, 'h': 2, 'y': 2},) 

Ngrams and frequencies for history ('t', 'a'):
({'k': 74, 'l': 502, 'y': 17, 'n': 432, 'r': 313, '#': 164, 'i': 247, 'b': 323, 't': 528, 'c': 148, 's': 137, 'u': 29, 'p': 78, 'g': 133, 'f': 22, 'x': 30, 'm': 115, 'd': 20, 'w': 16, 'h': 7, 'v': 16, 'j': 1, 'e': 8, 'o': 4, 'a': 3, 'z': 3},) 

Ngrams and frequencies for history ('r', 'b'):
({'a': 102, '#': 23, 'e': 81, 'i': 99, 'y': 9,

In [None]:
print(lm1.types2freq["null"])

In [4]:
# instantiate the 3gram, token-based LM here and show the required details

# Initiating the model
lm2 = LM(subtlex_df, ngram_size=3, tokens=True)

# Show the vocabulary size
print("Vocabulary size: ", lm2.vocab_size, "\n")

lm2.update_counts()
# Show number of history ngrams
print("Number of history ngrams: ", len(list(lm2.counts)), "\n")

# Show history states
for _ in range(5):
    history = randint(0, len(list(lm2.counts)))
    print("Ngrams and frequencies for history {}:".format(list(lm2.counts.items())[history][0]))
    print(list(lm2.counts.items())[history][1:], "\n")

# Find perplexity of "computational"
print("Perplexity for \"computational\": ", lm2.perplexity("computational"))

Vocabulary size:  27 

Number of history ngrams:  627 

Ngrams and frequencies for history ('t', 'u'):
({'r': 84694, 'a': 27408, 'f': 16585, 'p': 10424, 'c': 5061, 'l': 4588, 'n': 12074, 'd': 12760, 'e': 2342, 's': 1896, 'b': 3412, 'm': 3468, 't': 3388, 'i': 576, 'x': 376, '#': 259, 'g': 410, 'o': 388, 'k': 1, 'u': 1},) 

Ngrams and frequencies for history ('f', 'i'):
({'r': 66130, 'n': 114185, 'v': 14653, 'c': 37147, 'g': 28261, 'x': 7106, 't': 8366, 'e': 11748, 's': 9263, 'l': 17253, 'd': 3270, 'f': 2964, 'a': 1139, 'b': 582, '#': 82, 'j': 75, 'z': 73, 'q': 27, 'o': 8, 'k': 11, 'p': 2, 'm': 1, 'y': 2},) 

Ngrams and frequencies for history ('u', 'x'):
({'u': 401, '#': 638, 'e': 338, 'i': 123, 't': 15, 'o': 8},) 

Ngrams and frequencies for history ('e', 'h'):
({'i': 11268, 'o': 4266, '#': 3430, 'a': 4202, 'e': 2913, 'y': 197, 'u': 32, 'r': 13, 'n': 1},) 

Ngrams and frequencies for history ('b', 'h'):
({'o': 118, 'u': 17, 'i': 7, 'e': 5, 'a': 3},) 

Perplexity for "computational":  1

In [5]:
# instantiate the 4gram, type-based LM here and show the required details

# Initiating the model
lm3 = LM(subtlex_df, ngram_size=4, tokens=False)

# Show the vocabulary size
print("Vocabulary size: ", lm3.vocab_size, "\n")

lm3.update_counts()
# Show number of history ngrams
print("Number of history ngrams: ", len(list(lm3.counts)), "\n")

# Show history states
for _ in range(5):
    history = randint(0, len(list(lm3.counts)))
    print("Ngrams and frequencies for history {}:".format(list(lm3.counts.items())[history][0]))
    print(list(lm3.counts.items())[history][1:], "\n")

# Find perplexity of "computational"
print("Perplexity for \"computational\": ", lm3.perplexity("computational"))

Vocabulary size:  27 

Number of history ngrams:  7116 

Ngrams and frequencies for history ('k', 'm', 'o'):
({'b': 1},) 

Ngrams and frequencies for history ('u', 'f', 'y'):
({'i': 1},) 

Ngrams and frequencies for history ('e', 'r', 'z'):
({'e': 2, 'o': 1},) 

Ngrams and frequencies for history ('j', 'a', 't'):
({'#': 1, 'o': 1},) 

Ngrams and frequencies for history ('v', 'i', 'c'):
({'e': 23, 't': 36, 'i': 9, '#': 5, 'l': 2, 'a': 7, 's': 2, 'h': 4, 'u': 3, 'o': 1},) 

Perplexity for "computational":  3.2894573817848176


In [6]:
# instantiate the 4gram, token-based LM here and show the required details

# Initiating the model
lm4 = LM(subtlex_df, ngram_size=4, tokens=True)

# Show the vocabulary size
print("Vocabulary size: ", lm4.vocab_size, "\n")

lm4.update_counts()
# Show number of history ngrams
print("Number of history ngrams: ", len(list(lm4.counts)), "\n")

# Show history states
for _ in range(5):
    history = randint(0, len(list(lm4.counts)))
    print("Ngrams and frequencies for history {}:".format(list(lm4.counts.items())[history][0]))
    print(list(lm4.counts.items())[history][1:], "\n")

# Find perplexity of "computational"
print("Perplexity for \"computational\": ", lm4.perplexity("computational"))

Vocabulary size:  27 

Number of history ngrams:  7116 

Ngrams and frequencies for history ('+', 'w', 'h'):
({'a': 517754, 'o': 152776, 'e': 208189, 'y': 114704, 'i': 59790, '#': 147, 'u': 109, 's': 1},) 

Ngrams and frequencies for history ('p', 'a', 'd'):
({'#': 495, 's': 170, 'd': 643, 'r': 293, 'e': 294, 'l': 61, 'o': 18, 'i': 6},) 

Ngrams and frequencies for history ('d', 'o', 'p'):
({'e': 988, 't': 1019, 'a': 78, 'h': 108, 'p': 62, 'i': 8, '#': 5, 'l': 3, 'o': 1, 's': 1},) 

Ngrams and frequencies for history ('s', 'i', 'x'):
({'#': 10176, 't': 1341, 'e': 74, 'p': 30, 'i': 2},) 

Ngrams and frequencies for history ('+', 'w', 'o'):
({'u': 116905, 'r': 148019, 'n': 58305, 'm': 32473, 'w': 8181, 'k': 1460, 'o': 5290, 'l': 1702, 'e': 164, 'p': 95, '#': 58, 'b': 92, 'v': 44, 'a': 15, 'g': 4, 't': 3, 'd': 4},) 

Perplexity for "computational":  4.035607213308256


## TASK 4
For each of the four language models, generate the likeliest string, ensuring it doesn't exceed 50 characters. Return a Pandas df that looks like this:

| n-gram size | token-based | word |
|---|---|---|
| 3 | True | xyz |
| 4 | False | wxywz |
|...|...|...|...|

The table should have 4 entries, one for each language model. After generating the likeliest words given each LM, comment on the results (does the model generate real English words or not? If not, are they plausible words?) and explain why, in your opinion, these words appear to be very likely, comparing results across ngram sizes and training regime.


In [7]:
# generate most likely strings

lm1_string = lm1.generate()
lm2_string = lm2.generate()
lm3_string = lm3.generate()
lm4_string = lm4.generate()
    
generation_data = {"N-gram size": [3, 3, 4, 4], "Token-based": [False, True, False, True],
                            "Word":[lm1_string, lm2_string, lm3_string, lm4_string]}
generation_df = pd.DataFrame(generation_data)
generation_df.index = generation_df.index + 1
display(generation_df)

Unnamed: 0,N-gram size,Token-based,Word
1,3,False,st
2,3,True,the
3,4,False,stant
4,4,True,the


_Comment on the results here_.
Does the model generate real English words or not? 
    If not, are they plausible words? 
Explain why, in your opinion, these words appear to be very likely, comparing results across ngram sizes and training regime.

The first model which is type-based and has an n-gram size of 3 generates the word "st" which is neither English nor plausible, because in the English language, there are no proper words without vowels. I think that this word is likely because the combination of the letters s and t is probably very likely in the English language, both as the beginning of a sequence and at the end. Therefore, I think that this word is likely by itself because in the trigram, there are probably many occurrences of it at the beginning of longer words as well as at the end, so both the trigrams "+st" and "st#" are probably very likely.

The second, token-based model with n-gram size being 3 as well generates the word "the". This is an English word. Seeing as "the" is a stop word, it is not surprising that it has a high occurrence and probability.

The third model which is type-based and has n-gram size 4 generates the word "stant". This word is not English, however, it is more plausible than the one generated by the first model. It also starts with the letters s and t, which confirms my suspicion that those two letters in combination are likely as a beginning of sequence. The combination of a, n and t occurs often at the end of a sequence in English as an affix in adjectives like "constant" or "arrogant". Seeing as higher n-grams are more constraining and precise, it makes sense that despite "stant" not being an actual English word, it is closer to English than the word generated in the trigram type-based model.

The fourth model also has the n-gram size 4 but is token-based, and it also generates the word "the", a real English word. This shows that here, the token-based models presented better outcomes for the likeliest sequence. This is probably because in token-based model, frequent words have higher frequencies, whereas in the type-based model each word has the frequency of 1. Therefore, in a token-based model, n-grams that come from likely words have even higher frequencies, which is probably why this model is able to produce a better generated sequence.


## TASK 5
For each of the four language models, find the word from SUBTLEX-US with the lowest perplexity separately for words consisting of 3, 8, and 13 letters. Return a Pandas df which looks like this:

| n-gram size | token-based | word | perplexity |
|---|---|---|---|
| 3 | True | xyz | 1.2345 |
| 4 | False | wxywz | 5.4321 |
|...|...|...|...|

Perplexity scores should be rounded to 4 decimal places. The table should have 12 entries (ngram_size=(3|4) * tokens=(True|False) * word_length(3|8|13)). After finding the least surprising words, comment on the results and  identify interesting trends (e.g., frequent or infrequent n-grams appearing? differences depending on training regime (ngram size and types v. tokens?).

In [8]:
# find words with lowest perplexity and return target df here

perplexity3 = []
perplexity8 = []
perplexity13 = []
word3 = []
word8 = []
word13 = []

lm_list = [lm1, lm2, lm3, lm4]
for lm in lm_list:
    lm.update_counts()

for i in range(4):
    lm_eight_letter = {}
    lm_three_letter = {}
    lm_thirteen_letter = {}
    
    if i == 0:
        for word in lm1.types2freq:
            if len(word) == 3:
                perplexity = lm1.perplexity(word)
                lm_three_letter[perplexity] = word
            elif len(word) == 8:
                perplexity = lm1.perplexity(word)
                lm_eight_letter[perplexity] = word
            elif len(word) == 13:
                perplexity = lm1.perplexity(word)
                lm_thirteen_letter[perplexity] = word
    elif i == 1:
        for word in lm2.types2freq:
            if len(word) == 3:
                perplexity = lm2.perplexity(word)
                lm_three_letter[perplexity] = word
            elif len(word) == 8:
                perplexity = lm2.perplexity(word)
                lm_eight_letter[perplexity] = word
            elif len(word) == 13:
                perplexity = lm2.perplexity(word)
                lm_thirteen_letter[perplexity] = word
    elif i == 2:
        for word in lm3.types2freq:
            if len(word) == 3:
                perplexity = lm3.perplexity(word)
                lm_three_letter[perplexity] = word
            elif len(word) == 8:
                perplexity = lm3.perplexity(word)
                lm_eight_letter[perplexity] = word
            elif len(word) == 13:
                perplexity = lm3.perplexity(word)
                lm_thirteen_letter[perplexity] = word
    elif i == 3:
        for word in lm4.types2freq:
            if len(word) == 3:
                perplexity = lm4.perplexity(word)
                lm_three_letter[perplexity] = word
            elif len(word) == 8:
                perplexity = lm4.perplexity(word)
                lm_eight_letter[perplexity] = word
            elif len(word) == 13:
                perplexity = lm4.perplexity(word)
                lm_thirteen_letter[perplexity] = word
    
    smallest3 = min(lm_three_letter)
    perplexity3.append(smallest3)
    word3.append(lm_three_letter[smallest3])
    smallest8 = min(lm_eight_letter)
    perplexity8.append(smallest8)
    word8.append(lm_eight_letter[smallest8])
    smallest13 = min(lm_thirteen_letter)
    perplexity13.append(smallest13)
    word13.append(lm_thirteen_letter[smallest13])

perplexities = []
words = []
for i in range(4):
    perplexities.append(perplexity3[0])
    perplexity3.pop(0)
    perplexities.append(perplexity8[0])
    perplexity8.pop(0)
    perplexities.append(perplexity13[0])
    perplexity13.pop(0)
    words.append(word3[0])
    word3.pop(0)
    words.append(word8[0])
    word8.pop(0)
    words.append(word13[0])
    word13.pop(0)

perplexities = [round(num, 4) for num in perplexities]
        
n_gram_data = [3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4]
tokens_data = [False, False, False, True, True, True, False, False, False, True, True, True]
lm_data = [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4]

        
perplexity_data = {"LM num": lm_data, "N-gram size": n_gram_data, "Token-based": tokens_data, "Word": words,
                  "Perplexity": perplexities}
perplexity_df = pd.DataFrame(perplexity_data)
display(perplexity_df)

Unnamed: 0,LM num,N-gram size,Token-based,Word,Perplexity
0,1,3,False,ing,3.2399
1,1,3,False,mentions,4.0539
2,1,3,False,fractionating,4.3841
3,2,3,True,you,2.4519
4,2,3,True,anything,3.682
5,2,3,True,motherfucking,4.9008
6,3,4,False,man,5.2905
7,3,4,False,stations,3.096
8,3,4,False,sexualization,2.7874
9,4,4,True,you,2.1969


_Comment on the results here_.
Comment on the results and identify interesting trends (e.g., frequent or infrequent n-grams appearing? Differences depending on training regime (ngram size and types v. tokens?).

The word with the lowest perplexity is "you" in LM4, so the token-based fourgram. The word "you" also has the lowest perplexity in the token-based trigram model. This is not surprising because it is a very common word in the English language, and it is a stop word. Therefore, especially  in the token-based models, it is understandable that this word is likely because stop words tend to be repeated very often, so it must have a high frequency count.
Something that is notable in the results of the 3-letter-words is the one with the lowest perplexity in the token-based trigram (LM1): It is "ing": Although "ing" by itself is not a real word in the English language, it still has the lowest perplexity. However, when we think about it more, this also makes sense. "ing" by itself must exist as a word somewhere in the corpus, and because it is a very common affix for many words, the trigram has a low perplexity. This is also shown in this table as four other words ("fractionating", "anything", "motherfucking" and "anything") have the affix of "-ing".

Another notable observation from this table is the word "fractionating", which has the lowest perplexity in LM1 (type-based trigram). This word does exist in English, however, I am going off sheer intuition when I am estimating that this word is usually not very common, so for me, it is surprising to see that it has the lowest perplexity out of all of the 13-letter words in this model. However, this can also be explained, taking into account multiple factors. First of all, frequency always depends on the corpus itself, so maybe similar words are represented a lot in this dataset. Secondly, due to the model being a trigram model, this specific combination of letters is more likely than it would be in a fourgram model. Furthermore, the model is type-based, so the word itself is just as frequent as any other thirteen-letter word. Due to the fact that in general, words with thirteen letters are very rare, there are probably not so many in this corpus anyway, so it is more likely for usually rare words to have frequent ngrams and low perplexity.

Finally, another observation that can be made is that the token-based, fourgram model (LM4) is the one with the lowest perplexities overall. This is probabably because the higher the ngram size, the more precise it will be, therefore, there is less variation and less perplexity. Also, as explained in the previous task, token-based words represent highly frequent words (e.g. stop words) better as the frequencies of their ngrams increases a lot as well. Therefore, it makes sense that the words in this model overall have the lowest perplexity.

## TASK 6
For each language model, generate 10 strings based on the probabilities the language models have learned from data - so don't generate the most likely string! - and then retrieve the 20 nearest orthographic neighbors to each generated string using SUBTLEX as your reference vocabulary. Then, compute the average length of the generated strings and the average Levenshtein distance between each generated string and its 20 nearest orthographic neighbors for each model. You can use the old20 package available here: https://github.com/stephantul/old20. It provides a fast, off-the-sheleves implemention: you should clone the repository on your local machine to run it. Return four lists with the words generated by each model and a Pandas df that looks like this:

| n-gram size | token-based | avg OLD20 | avg length |
|---|---|---|---|
| 3 | True | 100.12345 | 5.4321 |
| 4 | False | 50.9876 | 5.6789 |
|...|...|...|...

Numerical values should be rounded to the fourth decimal place. The final table should have 4 entries, one for each LM. After generating results, comment on them (e.g., are there systematic differences along ngram size and/or training regime? what may they reflect? are trends similar for length and OLD20? What does it mean for a generate string to have a high OLD20? How does this relate to LMs?).



In [9]:
from old20 import old20, old_n

In [10]:
# First, generate 10 strings with max_prob = False
lm1_strings = {}
lm2_strings = {}
lm3_strings = {}
lm4_strings = {}
# Making sure there are no duplicates by using sets
lm1_strings = set(lm1_strings)
lm2_strings = set(lm2_strings)
lm3_strings = set(lm3_strings)
lm4_strings = set(lm4_strings)
average_lengths = []

# Generating the strings
while True:
    string1 = lm1.generate(max_prob = False)
    lm1_strings.add(string1)
    if len(lm1_strings) >= 10:
        lm1_strings = list(lm1_strings)
        # Calculating the average length
        totallength = 0
        for i in range(len(lm1_strings)):
            totallength += len(lm1_strings[i])
        average_lengths.append(totallength/10)
        break

        
while True:
    string2 = lm2.generate(max_prob = False)
    lm2_strings.add(string2)
    if len(lm2_strings) >= 10:
        lm2_strings = list(lm2_strings)
        totallength = 0
        for i in range(len(lm2_strings)):
            totallength += len(lm2_strings[i])
        average_lengths.append(totallength/10)
        break
        
while True:
    string3 = lm3.generate(max_prob = False)
    lm3_strings.add(string3)
    if len(lm3_strings) >= 10:
        lm3_strings = list(lm3_strings)
        totallength = 0
        for i in range(len(lm3_strings)):
            totallength += len(lm3_strings[i])
        average_lengths.append(totallength/10)
        break
        
while True:
    string4 = lm4.generate(max_prob = False)
    lm4_strings.add(string4)
    if len(lm4_strings) >= 10:
        lm4_strings = list(lm4_strings)
        totallength = 0
        for i in range(len(lm4_strings)):
            totallength += len(lm4_strings[i])
        average_lengths.append(totallength/10)
        break

allwords = list(subtlex_df["Word"])

In [11]:
# Making sure there are no duplicates
allwordsnew = list(set(allwords))

# Computing OLD for all of the lists of strings
average_old = []
for i in range(4):
    if i == 0:
        old_lm1 = old_n(lm1_sxtrings, allwordsnew, n=20)
        current = old_lm1
    elif i == 1:
        old_lm2 = old_n(lm2_strings, allwordsnew, n=20)
        current = old_lm2
    elif i == 2:
        old_lm3 = old_n(lm3_strings, allwordsnew, n=20)
        current = old_lm3
    elif i == 3:
        old_lm4 = old_n(lm4_strings, allwordsnew, n=20)
        current = old_lm4
    # Calculating the average
    total = sum(current)
    average_old.append(total/10)

In [12]:
# Make the DataFrame and print lists
model_name = [1, 2, 3, 4]
ngram_size = [3, 3, 4, 4]
token_based = [False, True, False, True]

old_dataframe = {"model_name": model_name, "ngram_size": ngram_size, "token_based": token_based,
                "avg OLD20": average_old, "avg length": average_lengths}

old_df = pd.DataFrame(old_dataframe)

display(old_df)

print("List of generated words for type-based 3-gram (LM1): ", "\n", lm1_strings, "\n")
print("List of generated words for token-based 3-gram (LM2): ", "\n", lm2_strings, "\n")
print("List of generated words for type-based 4-gram (LM3): ", "\n", lm3_strings, "\n")
print("List of generated words for token-based 4-gram (LM4): ", "\n", lm4_strings)

Unnamed: 0,model_name,ngram_size,token_based,avg OLD20,avg length
0,1,3,False,56.5,6.2
1,2,3,True,41.5,4.0
2,3,4,False,91.1,9.5
3,4,4,True,29.0,4.4


List of generated words for type-based 3-gram (LM1):  
 ['sunprilmakah', 'lomp', 'lerces', 'ps', 'stor', 'supgraism', 'aninate', 'per', 'arti', 'valgarwanic'] 

List of generated words for token-based 3-gram (LM2):  
 ['whe', 'i', 'nown', 'and', 'he', 'you', 'ple', 'fordaybefulead', 'birpst', 'a'] 

List of generated words for type-based 4-gram (LM3):  
 ['refindbradinermastenseropard', 'actions', 'kha', 'dragealed', 'pernal', 'noodeled', 'tra', 'cances', 'cated', 'coopentrumblinessing'] 

List of generated words for token-based 4-gram (LM4):  
 ['like', 'here', 'so', 'decidea', 'got', 'it', 'serts', 'buildies', 'only', 'lover']


_Comment on the results here_.
Are there systematic differences along ngram size and/or training regime? what may they reflect? are trends similar for length and OLD20? What does it mean for a generate string to have a high OLD20? How does this relate to LMs?)

In these results, we can observe that the token-based models generate words with significantly lower OLDs. If a generate string has a high OLD, it means that its average distance to other words in the corpus is higher, so you could say that the word is more unique and out-of-the-ordinary, and that the individual ngrams in the word are less common throughout the entirety of the data, because the average edit distance is high. The type-based words having higher OLD therefore shows that the words that were generated based on probability distributions of their ngrams are not as alike to the average composition of words as the token-based generate words are. This could again be due to the fact that the likelier ngrams which occur in frequent words do not have the "advantage" of occurring in frequent words but are also only represented within the distinct words. 

The average generate word length also differs greatly between training regimes with the token-based words. The differences in length can be correlated to the differences in OLDs, because in these results, the higher the OLD, the higher the word length. A possible reason for this outcome could be that the more likely the words are to be actual English words and similar to the ones in the corpus, in other words the lower the OLD, the closer they probably are in length to the average word length in the corpus. Here, it could be interesting to compute the average length of all of the words in the corpus in order to investigate this hypothesis. It is however the case that in the English language in general, three-letter words are the most frequent. This makes sense with the assumption that the higher the OLD, the more likely the words are to be similar to words in the corpus, because therefore, they are also more similar in length.

The trends also change depending on ngram size, with the higher ngrams generally producing lower OLDs. This also suggest that language models with higher ngram sizes generate words which are on average closer to the data in the corpus. Of course, this in turn means that the higher the ngram size, the less the  model can be generalised onto other corpora, but for the purposes in this assignment, the higher ngram turns out to be benficial.

## TASK 7

### 7a
Access the NorthEuraLex dataset and retrieve the words corresponding to the core concepts MOUTH, DREAM, SUN, APPLE, BRIDGE, MIRROR, SKY, FISH, ROOSTER, and SON for the following languages: Czech, Dutch, Finnish, Italian, and Basque. Store them in whatever file you like but arrange them in a Pandas df and show its first 10 lines. 


In [13]:
# Load the target words here
# I was not able to retrieve a dataset consisting of the different words in the different languages combined,
# so I decided to create my dataframe manually from the website.
# Accessed from: http://northeuralex.org/parameters
import pandas as pd

english = ['mouth', 'dream', 'sun', 'apple', 'bridge', 'mirror', 'sky', 'fish', 'rooster', 'son']
basque = ['aho', 'amets', 'eguzki', 'sagar', 'zubi', 'mirail', 'zeru', 'arrain', 'oilar', 'seme']
czech = ['pusa','sen', 'slunce', 'jablko', 'most', 'zrcadlo', 'nebe', 'ryba',  'kohout', 'syn']
dutch = ['mond', 'droom', 'zon', 'appel', 'brug', 'spiegel', 'hemel', 'vis', 'haan',  'zoon']
finnish = ['suu', 'uni', 'aurinko', 'omena', 'silta', 'peili', 'taivas', 'kala', 'kukko', 'poika']
italian = ['bocca', 'sogno', 'sole', 'mela', 'ponte', 'specchio', 'cielo', 'pesce', 'gallo', 'figlio']

languages = ["basque", "czech", "dutch", "finnish", "italian"]

word_list_languages = [basque] + [czech] + [dutch] + [finnish] + [italian]

core_concepts = []
for word in english:
    core_concepts.append(word.upper())

nelex_data = {"Core Concepts":core_concepts, "English":english, "Basque":basque, "Czech":czech,
             "Dutch":dutch, "Finnish":finnish, "Italian":italian}

nelex_df = pd.DataFrame(nelex_data)
display(nelex_df[:10])

Unnamed: 0,Core Concepts,English,Basque,Czech,Dutch,Finnish,Italian
0,MOUTH,mouth,aho,pusa,mond,suu,bocca
1,DREAM,dream,amets,sen,droom,uni,sogno
2,SUN,sun,eguzki,slunce,zon,aurinko,sole
3,APPLE,apple,sagar,jablko,appel,omena,mela
4,BRIDGE,bridge,zubi,most,brug,silta,ponte
5,MIRROR,mirror,mirail,zrcadlo,spiegel,peili,specchio
6,SKY,sky,zeru,nebe,hemel,taivas,cielo
7,FISH,fish,arrain,ryba,vis,kala,pesce
8,ROOSTER,rooster,oilar,kohout,haan,kukko,gallo
9,SON,son,seme,syn,zoon,poika,figlio


### 7b
Use the four language models you trained to obtain the perplexity of each translation of the 10 target core concepts and average perplexity scores per language. You should return a table that looks like this:

| n-gram size | token-based | lang | perplexity |
|---|---|---|---|
| 3 | True | NED | 5.4321 |
| 4 | False | ITA | 5.6789 |
|...|...|...|...|

The df should have 20 lines (ngram_size=(3|4), tokens=(True|False), languages=(ITA|NED|CZE!BAS|FIN)), and perplexity scores should be rounded to the 4th decimal place.
Briefly comment on the results, with particular focus on the relation between average perplexity and proximity of each language to English from a phylogenetic point of view.


In [14]:
# Compute the average perplexity of English LMs on translations of core concepts in other languages.

# Averages for language model 1
lm1_perplexities = []
for language in range(len(word_list_languages)):
    totalperp = 0
    for word in word_list_languages[language]:
        perp = lm1.perplexity(word)
        totalperp += perp
    averageperp = totalperp / 10
    lm1_perplexities.append(averageperp)

# Averages for language model 2
lm2_perplexities = []
for language in range(len(word_list_languages)):
    totalperp = 0
    for word in word_list_languages[language]:
        perp = lm2.perplexity(word)
        totalperp += perp
    averageperp = totalperp / 10
    lm2_perplexities.append(averageperp)
    
# Averages for language model 3
lm3_perplexities = []
for language in range(len(word_list_languages)):
    totalperp = 0
    for word in word_list_languages[language]:
        perp = lm3.perplexity(word)
        totalperp += perp
    averageperp = totalperp / 10
    lm3_perplexities.append(averageperp)
    
# Averages for language model 4
lm4_perplexities = []
for language in range(len(word_list_languages)):
    totalperp = 0
    for word in word_list_languages[language]:
        perp = lm4.perplexity(word)
        totalperp += perp
    averageperp = totalperp / 10
    lm4_perplexities.append(averageperp)
    
model_num = [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4]
ngram_size = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4]
lang = ["BAS", "CZE", "NED", "FIN", "ITA"]
lang = lang*4
token_based = [False, False, False, False, False, True, True, True, True, True]
token_based = token_based*2

all_perplexities = lm1_perplexities + lm2_perplexities + lm3_perplexities + lm4_perplexities

all_perplexities = [round(num, 4) for num in all_perplexities]

perplexity_language_data = {"model_num": model_num, 
                            "ngram_size": ngram_size,
                            "token-based": token_based,
                            "lang": lang, 
                            "perplexity": all_perplexities}
perplexity_language_df = pd.DataFrame(perplexity_language_data)

display(perplexity_language_df)

Unnamed: 0,model_num,ngram_size,token-based,lang,perplexity
0,1,3,False,BAS,34.6961
1,1,3,False,CZE,32.743
2,1,3,False,NED,17.3842
3,1,3,False,FIN,32.0126
4,1,3,False,ITA,15.9396
5,2,3,True,BAS,134.6482
6,2,3,True,CZE,140.8291
7,2,3,True,NED,47.4565
8,2,3,True,FIN,160.7889
9,2,3,True,ITA,28.1594


_Comment on the results here_.
Briefly comment on the results, with particular focus on the relation between average perplexity and proximity of each language to English from a phylogenetic point of view.

In these results, it is noticeable that overall, the Dutch words have the lowest perplexity by far. Especially in the token-based models, this difference is very large.

In LM1 which is type-based and uses trigrams, Italian and Dutch have relatively low average perplexities whereas Finnish, Basque and Czech have higher perplexities. In the token-based trigra LM2, Italian actually shows the lowest perplexity. In LM3, the type-based fourgram model, Dutch again has the lowest perplexity by far and Czech has the highest. In the last model which is token-based and has ngram size 4, the differences in perplexities are very big and czech is by far the highest.

These results can be explained with phylogenetics, because the Dutch language is closest to English because both are Germanic languages. Therefore, they have words with similar origins which in turn have similar ngrams. On the other hand, the other languages are not very similar to English in that aspect. Basque is a language which is very isolated and seemingly unrelated to most other languages. Italian is a Romance language which after Dutch is closest to English in origin, but still not very similar. Finnish is a Uralic language, which means that it does not resemble Germanic languages, as well as Czech, which has its origin in Slavic languages. 

Due to those phylogenetic distances, the combinations of letters and characters in the different languages often differ greatly, therefore, the ngrams are very different, which increases perplexity.
In the most precise and constrained model out of the four, model LM4 (token-based, fourgram) these phylogenetic differences are represented the most by the results.

### 7c
Find the strings with the highest perplexity under each language model, regardless of language and core concept. For each language model, compute the perplexity of each translation of the 10 core concepts and return the one with the highest perplexity for each LM. You should return a df like this:

| n-gram size | token-based | concept | lang | word | perplexity |
|---|---|---|---|---|---|
| 3 | True | APPLE | ITA | mela | 5.4321 |
| 4 | False | SKY | NED | hemel | 5.6789 |
|...|...|...|...|...|...|

The words and languages in the example table are for illustrative purpose. The table should conjtain 4 entries, one for each LM. Briefly comment on the results highlighting what in the most surprising words may be responsible for the high perplexity and why, if so, different LMs yield different results (considering ngram size and training regime).



In [15]:
# Find words with highest perplexity here.
# averages for language model 1
lm1_perplexities = {}
for language in range(len(word_list_languages)):
    for word in word_list_languages[language]:
        perp = lm1.perplexity(word)
        lm1_perplexities[perp] = word
biggest_1 = max(lm1_perplexities)

lm2_perplexities = {}
for language in range(len(word_list_languages)):
    for word in word_list_languages[language]:
        perp = lm2.perplexity(word)
        lm2_perplexities[perp] = word
biggest_2 = max(lm2_perplexities)

lm3_perplexities = {}
for language in range(len(word_list_languages)):
    for word in word_list_languages[language]:
        perp = lm3.perplexity(word)
        lm3_perplexities[perp] = word
biggest_3 = max(lm3_perplexities)

lm4_perplexities = {}
for language in range(len(word_list_languages)):
    for word in word_list_languages[language]:
        perp = lm4.perplexity(word)
        lm4_perplexities[perp] = word
biggest_4 = max(lm4_perplexities)

all_biggest = [biggest_1, biggest_2, biggest_3, biggest_4]

language_abbreviations = ["BAS", "CZE", "NED", "FIN", "ITA"]
model_number = [1, 2, 3, 4]
ngram_size = [3, 3, 4, 4]
token_based = [False, True, False, True]
perplex_words = []
langs = []
concepts = []


for i in range(4):
    if i == 0:
        perplex_word = lm1_perplexities[biggest_1]
    elif i == 1:
        perplex_word = lm2_perplexities[biggest_2]
    elif i == 2:
        perplex_word = lm3_perplexities[biggest_3]
    elif i == 3:
        perplex_word = lm4_perplexities[biggest_4]
    for j in range(len(word_list_languages)):
        for k in range(len(word_list_languages[j])):
            if word_list_languages[j][k] == perplex_word:
                which_language = j
                which_word = k
                
    perplex_words.append(perplex_word)
    langs.append(language_abbreviations[which_language])
    concepts.append(core_concepts[which_word])
    
biggest_perplexities_data = {"model_number": model_number, "ngram_size": ngram_size,"token_based":  token_based, 
                             "concept": concepts, "lang": langs,"word": perplex_words, "perplexity": all_biggest}

biggest_perplexities_df = pd.DataFrame(biggest_perplexities_data)

display(biggest_perplexities_df)
    

Unnamed: 0,model_number,ngram_size,token_based,concept,lang,word,perplexity
0,1,3,False,MIRROR,CZE,zrcadlo,121.145381
1,2,3,True,MIRROR,CZE,zrcadlo,786.144338
2,3,4,False,SKY,ITA,cielo,264.937257
3,4,4,True,SKY,ITA,cielo,2709.019126


_Comment on the results here_.
Briefly comment on the results highlighting what in the most surprising words may be responsible for the high perplexity and why, if so, different LMs yield different results (considering ngram size and training regime)

I think that the reason why the czeck word "zrcadlo" is so surprising to the models is that especially the combination of multiple subsequent consonants, is very different to ngrams that are usually found in English words. For example, the trigrams "+zr", "rca", or "dlo" are probably not very common in the corpus. Furthermore, the "lo#" at the end of the sequence as well as just the "o##" at the end of the sequence is something that is not as common in English. This might be why both "zrcadlo" and "cielo" have high perplexity and they both end with "lo".
The ngrams originating from the Italian word "cielo" are also not very common in English, for example the combinations of vowels here.

The words which were assessed in type-based models again have higher perplexities, which is probably the case because in these models, frequent words produce even more frequent ngrams, so the foreign words "stand out" and surprise the model even more because of their lower ngram probabilities in comparison to those of the English words.

The reason why there are two different words for the different ngram sizes might be because for example for the word "cielo", if one were to just consider trigrams, the word might not be so surprising, but with fourgrams, it is. It is not very common in English to have combinations of letters like "ciel".

### 7d
Find the strings with the lowest perplexity under each language model, regardless of language and core concept. For each language model, compute the perplexity of each translation of the 10 core concepts and return the one with the lowest perplexity for each LM. You should return a df like this:

| n-gram size | token-based | concept | lang | word | perplexity |
|---|---|---|---|---|---|
| 3 | True | APPLE | ITA | mela | 5.4321 |
| 4 | False | SKY | NED | hemel | 5.6789 |
|...|...|...|...|...|...|

The words and languages in the example table are for illustrative purpose. The table should contain 4 entries, one for each LM and perplexity scores should be rounded to the 4th decimal place. Briefly comment the results, highlighting what may cause the resulting strings to have low perplexity under LMs trained on English.


In [16]:
# Find words with lowest perplexity here.
# Find words with highest perplexity here.
# averages for language model 1
smallest_1 = min(lm1_perplexities)
smallest_2 = min(lm2_perplexities)
smallest_3 = min(lm3_perplexities)
smallest_4 = min(lm4_perplexities)

all_smallest= [smallest_1, smallest_2, smallest_3, smallest_4]

non_perplex_words = []
langs = []
concepts = []
for i in range(4):
    if i == 0:
        non_perplex_word = lm1_perplexities[smallest_1]
    elif i == 1:
        non_perplex_word = lm2_perplexities[smallest_2]
    elif i == 2:
        non_perplex_word = lm3_perplexities[smallest_3]
    elif i == 3:
        non_perplex_word = lm4_perplexities[smallest_4]
    for j in range(len(word_list_languages)):
        for k in range(len(word_list_languages[j])):
            if word_list_languages[j][k] == non_perplex_word:
                which_language = j
                which_word = k
                
    non_perplex_words.append(non_perplex_word)
    langs.append(language_abbreviations[which_language])
    concepts.append(core_concepts[which_word])
    
smallest_perplexities_data = {"model_number": model_number, "ngram_size": ngram_size,"token_based":  token_based, 
                             "concept": concepts, "lang": langs,"word": non_perplex_words, "perplexity": all_smallest}

smallest_perplexities_df = pd.DataFrame(smallest_perplexities_data)

display(smallest_perplexities_df)

Unnamed: 0,model_number,ngram_size,token_based,concept,lang,word,perplexity
0,1,3,False,SUN,ITA,sole,7.775581
1,2,3,True,BRIDGE,CZE,most,7.158575
2,3,4,False,DREAM,NED,droom,6.672799
3,4,4,True,BRIDGE,CZE,most,5.094632


_Comment on the results here_.

In this table, the strings probably have small scores for perplexity because they are very similar to possible English words - some even are actual English words. For example, "most", which means bridge in Czech, is a word in English, as well as "sole". "Droom" is not a word in English but the ngrams are very likely, and it could be a word. For example, an English speaker would not have any issues pronouncing this word as if it was English. The fourgrams can also occur in English words, for example "droo" as "drool" or "room" as "room".