# IN4080: obligatory assignment 1 (Autumn 2025)
 
Mandatory assignment 1 consists of three parts. In Part 1 (6 points), you will test and improve on a BPE (Byte-Pair-Encoding) tokenizer . In Part 2 (7 points), you will estimate an N-gram language model, based on a training corpus and the tokenizer you worked on in Part 1. Finally, in Part 3 (7 points), you will develop a basic classification model to distinguish between Bokmål and Nynorsk sentences.

You should answer all three parts. You are required to get at least 12/20 points to pass. The most important is that you try to answer each question (possibly with some mistakes), to help you gain a better and more concrete understanding of the topics covered during the lectures. There are also bonus questions for those of you who would like to deepen their understanding of the topics covered by this assignment.

- We assume that you have read and are familiar with IFI’s requirements and guidelines for mandatory assignments, see [here](https://www.uio.no/english/studies/examinations/compulsory-activities/mn-ifi-mandatory.html) and [here](https://www.uio.no/english/studies/examinations/compulsory-activities/mn-ifi-guidelines.html).
- This is an individual assignment. You should not deliver joint submissions. 
- You may redeliver in Devilry before the deadline (__Friday, September 12 at 23:59__), but include all files in the last delivery.
- Only the last delivery will be read! If you deliver more than one file, put them into a zip-archive. You don't have to include in your delivery the files already provided for this assignment. 
- Name your submission _your\_username\_in4080\_mandatory1_
- You can work on this assignment either on the IFI machines or on your own computer. 

*The preferred format for the assignment is a completed version of this Jupyter notebook*, containing both your code and explanations about the steps you followed. We want to stress that simply submitting code is __not__ by itself sufficient to complete the assignment - we expect the notebook to also contain explanations of what you have implemented, along with motivations for the choices you made along the way. Preferably use whole sentences, and mathematical formulas if necessary. Explaining in your own words (using concepts we have covered through in the lectures) what you have done and reflecting on your solution is an important part of the learning process - take it seriously!

Regarding the use of LLMs (ChatGPT or similar): you are allowed to use them as 'sparring partner', for instance to clarify something you have not understood. However, you are __not__ allowed to use them to generate solutions (either in part or in full) to the assignment tasks. 

__Technical tip__: Some of the tasks in this assignment will require you to extend methods in classes that are already partly implemented. To implement those methods directly in a Jupyter notebook, you can use the function `setattr` to attach a method to a given class: 

```python
class A:
    pass
a = A()

def foo(self):
    print('hello world!')
    
setattr(A, 'foo', foo)
```

First, make sure that all required modules are installed:

In [1]:
%pip install tqdm numpy scikit_learn

Note: you may need to restart the kernel to use updated packages.


## Part 1 : Tokenisation

We will start by building a basic tokenizer relying on white space and punctuation. 

__Task 1.1__ (2 points): Implement the method `split` below such that it takes a text as input and outputs a list of tokens. The tokenisation should simply be done by splitting on white space, except for punctuation markers and other symbols (`.,:;!?-()"`), which should correspond to their own token. For instance, the sentence "Pierre, who works at NR, also teaches at UiO." should be split into 12 tokens.

In [1]:
from typing import List
import re

def basic_tokenize(text: str) -> List[str]:
    """The method should split the text on white space, except for punctuation
    markers that should be considered as tokens of their own (even in the 
    absence of white space before or after their occurrence)"""

    # Implement here your basic tokenisation
    
    return re.findall(r"\w+|[.,:;!?\-()\"]",text)

In [2]:
a = "Pierre, who works at NR, also teaches at UiO."
token_a = basic_tokenize(a)
len(token_a)

12

Shu_explanation:
Since the regular expression package re is imported, we can use the findall function to extract tokens. The pattern \w+ is used to capture words. Still, we need to handle punctuation symbols so that they become separate tokens. 

Obs!: - and " must be handled with a backslash (\-), otherwise we will get a “bad character” error.

We will now run the tokeniser on a small corpus, the [Norwegian Dependency Treebank](https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-10/) (the corpus has been annotated with morphological features, syntactic functions and hierarchical structures, but we'll simply use here the raw text and discard all the annotation layers). We provide you with the data in the files `ndt_train_lm.txt` and `ndt_test_lm.txt`. 

__Task 1.2__ (1 point): Run the tokenizer you have implemented on `ndt_test_lm.txt`. How many tokens were extracted? And how many types (distinct words) were there? 

In [3]:
with open('./ndt_test_lm.txt', 'r', encoding='utf-8-sig') as f:
  text_test = f.read()
test_tokens = basic_tokenize(text_test)
#test_tokens

In [4]:
len(test_tokens)

258079

Shu_solution: the output of basic_tokenize function is a list of tokens. The difference between tokens and types(distinct words) is that types do not have repeated words. So we just need to take away the repeated words in a list. We can simply use set() to do that.

In [5]:
test_types = set(test_tokens)
len(test_types)

30006

We shall now use Byte-Pair Encoding (BPE) to limit the vocabulary of the tokenizer to 5,000.  An initial implementation of the algorithm is provided below.

In [6]:
from typing import Dict, List, Tuple, Iterator
import numpy as np
from tqdm.notebook import tqdm

class BPETokenizer:
    """Tokenizer based on the Byte-Pair Encoding algorithm. 
    Note: the current implementation is limited to Latin characters (ISO-8859-1)"""
    #def __init__(self, train_corpus_file: str, vocab_size = 500):
    def __init__(self, train_corpus_file: str, vocab_size = 5000): 
        """Creates a new BPE tokenizer, with merge pairs found using the given
        corpus file. The extraction of merge pairs stops when a vocabulary of 
        size vocab_size is reached."""

        # List of string pairs that should be merged when tokenizing
        # Example: ('e', 't'), which means that 'et' is a possible subword
        # Each string pair is mapped to an unique index number
        # (corresponding to their position in the self.vocab list)
        self.merge_pairs = {}

        # We add as basic vocab all characters of the extended ASCII
        self.vocab = [chr(i) for i in range(256)]

        ## Shu add###
        # identify punct
        self.punct = set('.,:;!?-()"')
        #########
       

        with open(train_corpus_file) as fd:

            # We first read the corpus, split on white space, and counts the
            # occurrences of each distinct word
            print("Counting word occurrences in corpus %s"%train_corpus_file, end="...", flush=True)
            text = fd.read()
            vocabulary_counts = {}
            for token in text.split():
                vocabulary_counts[token] = vocabulary_counts.get(token, 0) + 1
            print("Done")

            # We then iteratively extend the list of merge pairs until we
            # reach the desired size. Note: to speed up the algorithm, we 
            # extract n merge pairs at each iteration
            progress_bar = tqdm(total=vocab_size)
            while len(self.vocab) < vocab_size:
                most_common_pairs = self.get_most_common_pairs(vocabulary_counts)      
                for common_pair in most_common_pairs:
                    self.merge_pairs[common_pair] = len(self.vocab)
                    self.vocab.append("".join(common_pair))
                    if len(self.vocab) >= vocab_size:
                        break
                progress_bar.update(len(most_common_pairs))
                print("Examples of new subwords:", ["".join(pair) for pair in most_common_pairs][:10]) ##
            
    def get_most_common_pairs(self, vocabulary_counts: Dict[str,int], 
                              n:int=200) -> List[Tuple[str,str]]:
        """Given a set of distinct words along with their corresponding number 
        of occurrences in the corpus, returns the n most frequent pairs of subwords.       
        """

        # We count the frequencies of consecutive subwords in the vocabulary list
        pair_freqs = {}
        for word, word_count in vocabulary_counts.items():
            subwords = self.tokenize_word(word)
            
            for i in range(len(subwords)-1):
                byte_pair = (subwords[i], subwords[i+1])
                
                ##Shu add: skip punctuation-involved pairs########
                if any(char in self.punct for char in byte_pair[0]) or any(char in self.punct for char in byte_pair[1]):
                    continue
                ################################################    
        
                pair_freqs[byte_pair] = pair_freqs.get(byte_pair, 0) + word_count

        # And return the most frequent ones
        most_freq_pairs = sorted(pair_freqs.keys(), key=lambda x: pair_freqs[x])[::-1][:n]
        return most_freq_pairs

    def __call__(self, input:str, show_progress_bar=True) -> Iterator[str]:
        """Tokenizes a full text"""

        # We first split into whitespace-separated tokens, and then in subwords
        words = input.split()
        for word in tqdm(words) if show_progress_bar else words:
            subwords = self.tokenize_word(word)
            for subword in subwords:
                yield subword
                

    def tokenize_word(self, word):
        """Splits the word into subwords, according to the merge pairs 
        currently stored in self.merge_pairs."""

        # We start with a list of characters
        # (+ a final character to denote the end of the word)    
        splits = list(word) + [" "]

        # We continue until there is nothing left to be merged
        while len(splits)>=2:

            # We extract consecutive subword pairs
            pairs = [(splits[i], splits[i+1]) for i in range(len(splits)-1)]

            # We find the "best" pair of subwords to merge -- that is, the one with the 
            # lowest position in the list of merge rules
            best_pair_to_merge = min(pairs, key=lambda x: self.merge_pairs.get(x, np.inf))
            if best_pair_to_merge in self.merge_pairs:

                # We then merge the two subwords
                for i in range(len(splits)-1):
                    if (splits[i], splits[i+1]) == best_pair_to_merge:
                        merged_subword = self.vocab[self.merge_pairs[best_pair_to_merge]]
                        splits = splits[:i] + [merged_subword] + splits[i+2:]
                        break
            else:
                break
        return splits

__Task 1.3__ (1 point): Learn the BPE tokenizer on the `ndt_train_lm.txt` corpus, and then apply this tokenizer on `ndt_test_lm.txt`. Print the number of tokens and types (distinct subwords) obtained by this tokenizer on the test data. How do those numbers compare to the ones obtained with the basic tokenizer you had implemented earlier ? 

In [7]:
#Train the BPETokenizer based on training data.
tokenizer = BPETokenizer('ndt_train_lm.txt',vocab_size = 5000)

Counting word occurrences in corpus ndt_train_lm.txt...Done


  0%|          | 0/5000 [00:00<?, ?it/s]

Examples of new subwords: ['e ', 'r ', 'er', 't ', 'en', 'n ', 'de', 'et', 'te', 'g ']
Examples of new subwords: ['en ', 'om ', 'er ', 'og ', 'det ', 'som', 'for ', 'til ', 'på ', 'at ']
Examples of new subwords: ['som ', 'den ', 'ikkje ', 'men ', 'vil ', 'skal ', 'ikke ', 'politi', 'inga ', 'arbei']
Examples of new subwords: ['språk', 'meir ', 'mellom ', 'grunn', 'kunne ', 'sa ', 'kommun', 'sier ', 'norske ', 'iske ']
Examples of new subwords: ['partemen', 'økonom', 'histor', 'funn', 'sn', 'rund', 'mennesk', 'siste ', 'fått ', 'arbeid']
Examples of new subwords: ['departemen', 'samfunn', 'leiar ', 'rundt ', 'vurd', 'partementet ', 'folk', 'ering ', '6 ', 'gar']
Examples of new subwords: ['spesi', 'departementet ', 'mang', 'par ', 'alts', 'Fø', 'mme ', 'likevel ', 'minist', 'alen ']
Examples of new subwords: ['altså ', 'Språkrådet ', 'Arbei', 'millionar ', 'Førde ', 'plass ', 'ment', 'lik ', 'skre', 'gir ']
Examples of new subwords: ['Svalbar', 'alterna', 'enkel', 'stadig ', 'lenge ', 

In [8]:
#apply the trained tokenizer to the test data
test_new_tokens = []
for token in tokenizer(text_test):
    if token != " ": #this is because in the BPE tokenizer, we add a " "(space) to denote the end of the word. We need to get rid of it. 
        test_new_tokens.append(token)
#test_new_tokens

  0%|          | 0/226096 [00:00<?, ?it/s]

In [9]:
len(test_new_tokens)

398709

In [10]:
len(set(test_new_tokens))

4341

Shu_explain:
We can see that the token length using the BPETokenizer has increased compared to when using the basic tokenizer. The reason is in basic tokenizer the minimum unit is word. However, the minimum unit in BPETokenizer is subword. For example, the token 'Lam' has turned to 'La','m'. This explains why the number of tokens using BPETokenizer increased compared to the basic tokenizer.

The number of types has been greatly reduces when using BPETokenizer. The reason is since word has been splitted to subword, the subword can be combined in different orders to form different word. So the vocabulary of subwords is more compact and flexible. Another reason is, BPETokenizer has limit the vocabulary to a subword up to 5,000. So the BPETokenizer has filtered out the rare tokens compared to basic tokenizer. Thus BPETokenizer produces a much more concise and manageable vocabulary.

__Task 1.4__ (2 points): The current BPE implementation is that it treats all characters in the same manner. A rather inconvenient side effect is that letters may be merged together with punctuation markers (like 'ing', ',' --> 'ing,'), if they are not separated by white space. Modify the implementation of the BPE algorithm above to prevent punctuation markers to be merged with letters. 

Shu_solution: 
In this task, we need to avoid subword merging with punctuation. So first, we define a set of punctuation characters as self.punct.
Then before merging two common_pairs, we need to identify if punctuation exist in either common_pair. If yes, we skip merging using continue. otherwise, we proceed the combine work. 

__Task 1.5__ (_optional, 2 extra points_): In a [tweet](https://x.com/karpathy/status/1759996551378940395) published last year, the well-known AI researcher Andrej Karpathy stressed that many of the current limitations of Large Language Models are actually a product of the tokenisation step. Explain at least 4 of the problems he mentioned in his tweet (you can of course search online, or watch Karpathy's own video lecture on tokenization).

## Part 2: N-gram language models

We will now train simple N-gram language models on the NDT corpus, using the tokenizers we have developed in Part 1.

Here is the skeleton of the code:

In [11]:
import numpy as np
from abc import abstractmethod

class LanguageModel:
    """Generic class for running operations on language models, using a BPE tokenizer"""

    def __init__(self, tokenizer: BPETokenizer):
        """Build an abstract language model using the provided tokenizer"""
        self.tokenizer = tokenizer
 
    @abstractmethod
    def predict(self, context_tokens: List[str]):
        """Given a list of context tokens (=previous tokens), returns a dictionary
          mapping each possible token to its probability"""
        raise NotImplementedError()
    
    @abstractmethod
    def get_perplexity(self, text: str):
        """Computes the perplexity of the given text according to the LM"""

        print("Tokenising input text:")
        tokens = list(self.tokenizer(text))
        
        print("Computing perplexity:")
        log_probs = 0
        for i in tqdm(range(len(tokens))):
            context_tokens = ["<s>"] + tokens[:i]
            predict_distrib = self.predict(context_tokens)

            # We add the log-probabilities
            log_probs += np.log(predict_distrib[tokens[i]])
            
        perplexity = np.exp(-log_probs/len(tokens))
        return perplexity

class NGramLanguageModel(LanguageModel):
    """Representation of a N-gram-based language model"""

    def __init__(self, training_corpus_file: str, tokenizer:BPETokenizer, ngram_size:int=3,
                  alpha_smoothing:float=1):
        """Initialize the N-gram model with:
        - a file path to a training corpus to estimate the N-gram probabilities
        - an already learned BPE tokenizer
        - an N-gram size
        - a smoothing parameter (Laplace smoothing)"""
        
        LanguageModel.__init__(self, tokenizer)
        self.ngram_size = ngram_size
        
        # We define a simple backoff distribution (here just a uniform distribution)
        self.default_distrib = {token:1/len(tokenizer.vocab) for token in tokenizer.vocab}

        # Dictionary mapping a context (for instance the two preceding words if ngram_size=3)
        # to another dictionary specifying the probability of each possible word in the 
        # vocabulary. The context should be a tuple of tokens.
        self.ngram_probs = {}
        with open(training_corpus_file) as fd:   

            # based on the training corpus, tokenizer, ngram-size and smoothing parameter,
            # fill the self.ngram_probs with the correct N-gram probabilities  
            raise NotImplementedError()
 

    def predict(self, context_tokens: List[str]):
        """Given a list of preceding tokens, returns the probability distribution 
        over the next possible token."""

        # We restrict the contextual tokens to (N-1) tokens
        context_tokens = tuple(context_tokens[-self.ngram_size+1:])

        # If the contextual tokens were indeed observed in the corpus, simply
        # returns the precomputed probabilities
        if context_tokens in self.ngram_probs:
            return self.ngram_probs[context_tokens]
        
        # Otherwise, we return a uniform distribution over possible tokens
        else:
            return self.default_distrib

__Task 2.1__ (6 points): Complete the initialization method `__init__` to estimate the correct N-gram probabilities (with smoothing) based on the corpus. Don't worry about making your implementation super-efficient (although you can if you wish).

In [12]:
from collections import defaultdict, Counter

def __init__(self, training_corpus_file: str, tokenizer:BPETokenizer, ngram_size:int=2, alpha_smoothing=0.1):
        """Initialize the N-gram model with:
        - a file path to a training corpus to estimate the N-gram probabilities
        - an already learned BPE tokenizer
        - an N-gram size
        - a smoothing parameter (Laplace smoothing)"""

        

    
        LanguageModel.__init__(self, tokenizer)
        self.ngram_size = ngram_size
        
        # We define a simple backoff distribution (here just a uniform distribution)
        self.default_distrib = {token:1/len(tokenizer.vocab) for token in tokenizer.vocab}

        # Dictionary mapping a context (for instance the two preceding words if ngram_size=3)
        # to another dictionary specifying the probability of each possible word in the 
        # vocabulary. The context should be a tuple of tokens.
        self.ngram_probs = {}
    
        with open(training_corpus_file) as fd:  
            
            # ADD HERE YOUR CODE TO FILL THE VALUES IN self.ngram_probs

            #generate tokens
            text = fd.read()
            #tokens = list(tokenizer(text))
            
            tokens = ["<s>"] * (self.ngram_size - 1) + list(tokenizer(text))
            #print("Number of tokens in training data:", len(tokens))
            #print("First 20 tokens:", tokens[:20])

            
            # Dictionary mapping a context (for instance the two preceding words if ngram_size=3)
            # create two dictionary: 
            #dic: ngram_counts to count the number of occurance of contextual given target)
            #dic: context_counts to count the number of occurance of context
            #This help us to calculate the probability using Markov assumption

            """
            # efficiency too low
            ngram_counts = defaultdict(int)
            context_counts = defaultdict(int)


            #for i in tqdm(range(len(tokens))): 
            
            for i in tqdm(range(10000)): ## testing
                context_tokens = ["<s>"] + tokens[:i] # to add start sysmbo '<s>'

                ## We restrict the contextual tokens to (N-1) tokens
                context_tokens = tuple(context_tokens[-self.ngram_size+1:])  
                target = tokens[i]

                ## construct the ngram_counts where key is a tuple (context_tokens, target)
                ## then use defaultdict to count the occurrance of the key, iterate
                ngram_counts[(context_tokens, target)] += 1 

                ## construct the context_count using defaultdict to count the occurrance of the context_tokens
                context_counts[context_tokens] += 1
            """


            ngrams = [tuple(tokens[i:i+self.ngram_size]) for i in range(len(tokens)-self.ngram_size+1)]
            #ngrams = [tuple(tokens[i:i+self.ngram_size]) for i in range(100-self.ngram_size+1)]
            contexts = [item[:-1] for item in ngrams]

            ngram_counts = Counter(ngrams)
            context_counts = Counter(contexts)
            #print(ngram_counts,'--------------',context_counts)
            print(f"ngram_counts: {len(ngram_counts)}, context_counts: {len(context_counts)}")


            ## Now calculate the probability based on the count result
            ## Here we need to consider Laplace smoothing to avoid target appears in testing set not training set  
            

            for context_tokens in tqdm(list(context_counts)[:20000]): #[改] Here we run only a fraction of context_counts
                denominator = context_counts[context_tokens] + alpha_smoothing * (len(self.tokenizer.vocab))
                distrib = {
                            token: (ngram_counts.get((*context_tokens, token), 0) + alpha_smoothing) / denominator
                            for token in self.tokenizer.vocab
                            #for token in self.tokenizer.vocab[:100]
                        }
                self.ngram_probs[context_tokens] = distrib
                
            ctx = ('<s>','No',)
            sorted_probs = sorted(self.ngram_probs[ctx].items(), key=lambda x: x[1], reverse=True)
            print("Top 5 predictions:", sorted_probs[:5])                       

setattr(NGramLanguageModel, '__init__', __init__)

Shu_explaination:
In this task, we first tokenize the ndt_train_lm and generate the tokens for the training data. 
Then construct two dictionaries:
1. ngram_counts: to count the number of occurance of contextual given target, key as data structure as {((context),target):int}
2. context_counts: to count the occurrance of the context_tokens
When we have all the count numbers, we can calculate probabilities
Here we introduce Laplace smoothing to avoid the situation that target only appears in testing data not training data.

In [None]:
# test
from collections import defaultdict, Counter
a = [1,2,3,4,1,2,4,5,6]
b = [tuple(a[i:i+3]) for i in range(len(a)-3+1)]
c= [item[:-1] for item in b]
print(b,Counter(b),c,Counter(c))
#ngrams = [tuple(tokens[i:i+self.ngram_size]) for i in range(len(tokens)-self.ngram_size+1)]
#contexts = [ng[:-1] for ng in ngrams]

__Task 2.2__ (1 point): Train your language model in `ndt_train_lm.txt`, and compute its perplexity on the test data in `ndt_test_lm.txt`. The perplexity can be computed by calling the method `get_perplexity`. <br>
(_Note_: if the training takes too much time, feel free to stop the process after looking at a fraction of the corpus, at least while you are testing/developing your training setup).

In [13]:
lm = NGramLanguageModel("ndt_train_lm.txt",tokenizer, ngram_size=3, alpha_smoothing=0.1)

  0%|          | 0/872824 [00:00<?, ?it/s]

ngram_counts: 624814, context_counts: 262225


  0%|          | 0/20000 [00:00<?, ?it/s]

Top 5 predictions: [('kre ', 0.00219560878243513), ('\x00', 0.00019960079840319363), ('\x01', 0.00019960079840319363), ('\x02', 0.00019960079840319363), ('\x03', 0.00019960079840319363)]


In [14]:
with open('./ndt_test_lm.txt', 'r', encoding='utf-8-sig') as f:
  text_test = f.read(50000)
test_perplexity = lm.get_perplexity(text_test)

Tokenising input text:


  0%|          | 0/8199 [00:00<?, ?it/s]

Computing perplexity:


  0%|          | 0/16503 [00:00<?, ?it/s]

In [15]:
print(test_perplexity)

1554.11881177905


In [None]:
Shu_reflection:
In this part, 

__Task 2.3__ (_optional_, 4 bonus points): Improve the language model you have just developed. You can choose to focus on improving your model through a backoff mechanism, interpolation, or both. Once you are done, compute the perplexity again on the test corpus to ensure the language model has indeed improved.

## Part 3: Text classification

We will finally use the texts from the Norwegian Dependency Treebank for a classification task -- more precisely to determine whether a sentence is likely to be written in Bokmål or Nynorsk. To this end, we will use a simple bag-of-word setup (or more precisely a bag-of-_subwords_, since we will rely on the subwords extracted using BPE) along with a logistic regression model. As there is only two, mutually exclusive classes, you can view the task as a binary classification problem. 

The training data is found in `ndt_train_class.txt` and simply consists of a list of sentences, each sentence being mapped to a language form (Bokmål: `nob` or Nynorsk: `nno`). The language form is written at the end of each line, separated by a `\t`. Note the training examples are currently _not_ shuffled.

To train and apply your classifier, the easiest is to use the [`LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) model from `scikit-learn`.

__Task 3.1__ (2 points): Create a `N x V` matrix in which each line corresponds to a training example (out of `N` training instances) and each row corresponds to an individual feature, in this case the presence/absence of a particular subword in the sentence. In other words, there should be a total of `V` features, where `V` is the total size of the vocabulary for our BPE tokenizer. Also create a vector of length `N` with a value of `1` if the sentence was marked as Nynorsk, and 0 if is was marked as Bokmål. 

__Task 3.2__ (2 points): Use the data matrix you have just filled to train a logistic regression model (see the documentation on [`LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) for more details). We recommend to use the `liblinear` solver. 

__Task 3.3__ (1 point): Now apply the learned logistic regression model to the test set in `ndt_test_class.txt`, and evaluate its performance in terms of accuracy, recall and precision (you can use the functionalities in `sklearn.metrics` to compute those easily).

__Task 3.4__ (2 points): Inspect the weights learned by your logistic regression model (in `coef_`) and find the 5 subwords that contribute _the most_ to the classification of the sentence in Nynorsk. Also find the 5 subwords that contribute the most to the classification of the sentence in Bokmål. Do those weights make sense, according to what you know about Bokmål and Nynorsk ? 