# Bert-based morphological tagger <!-- Morfoloogiline ühestaja -->

## Table of contents
[Code](#kood)

   1. [**Collecting and tokenizing data**](#andmete_lugemine_kogumine_ja_margendamine)
      * [Verification](#kontrollimine_1)
   2. [**Statistics**](#statistika)
   3. [**Data splitting**](#andmete_jagamine_treening_test_ja_valideerimishulkadesse)
      * [Grouping data by text type](#andmete_grupeerimine_tekstiliigiti)
      * [Splitting data to sets](#andmete_jagamine_hulkadesse)
   4. [**Token classification model (for assigning morpological categories)**](#ner_mudel)
   5. [*Token classification model (for assigning morpological categories) by text type*](#ner_mudel_tekstiliigiti)
   6. [**Top n tags prediction**](#top_n_margendi_ennustamine)
   7. [**Model testing**](#mudeli_testimine)

[End](#end)

<!-- ## TODO
1) - [ ] Neurovõrkudel põhineva morfoloogilise ühestaja loomine ja integreerimine EstNLTK teegiga:
    1) - [ ] mudeli treenimiseks vajaliku töövoo loomine;
    2) - [ ] töövoo kasutamine mudeli treenimiseks;
    3) - [ ] mudeli integreerimine EstNLTK teeki.

2) - [ ] Morfoloogilise ühestaja treenimiseks vajalike andmestike moodustamine olemasolevatest andmestikest:
    1) - [ ] morfoloogilisemärgenduse loomine kasutades olemaolevaid EstNLTK automaatmärgendamise töövoogusid;
    2) - [ ] morfoloogilisemärgenduse loomine kasutades UD puudepanga andmeid;

3) - [ ] Saadud ühestaja tulemuste võrdlemine teiste kasutatavate rakendustega:
    1) - [ ] täpsuse ja saagise absoluutne hindamine testvalimil;
    2) - [ ] täpsuse ja saagise relatiivne hindamine märgendamata testandmetel;
    3) - [ ] uute informatiivsete treeningandmete välja valimine.

4) - [ ] Juhendmaterjalide  loomine uue ühestaja kasutamiseks -->

<a id='kood'></a>

## Code <!-- Kood -->

In [1]:
import os
import csv
import re
import gc
import pandas as pd
import sklearn as sk
import numpy as np
import estnltk
import torch
import logging
import simpletransformers
from simpletransformers.ner import NERModel, NERArgs

  from .autonotebook import tqdm as notebook_tqdm


INFO:config.py:58: PyTorch version 2.4.0 available.


<a id='andmete_lugemine_kogumine_ja_margendamine'></a>

### Collecting and tokenizing data <!-- Andmete lugemine kogumine ja margendamine -->

Data has been gathered from the [corpus](https://github.com/estnltk/eval_experiments_lrec_2020/blob/master/scripts_and_data/enc2017_selection_plain_texts_json.zip) that was used in evaluation experiments reported in LREC 2020 paper "EstNLTK 1.6: Remastered Estonian NLP Pipeline"[^1].

[^1]: Sven Laur, Siim Orasmaa, Dage Särg, Paul Tammo. "EstNLTK 1.6: Remastered Estonian NLP Pipeline". *Proceedings of The 12th Language Resources and Evaluation Conference*. European Language Resources Association: Marseille, France, May 2020, p. 7154-7162

Collecting `.json` files from directory <!--`json` failide kogumine kaustast-->

In [11]:
in_dir = '_plain_texts_json'

jsons = os.listdir(in_dir)

Reading text from files, tokenizing texts and save them file by file. <!--Failidest teksti lugemine, märgendamine ja salvestamine failhaaval -->

In [12]:
def create_csv_file_by_file():
    """
    Creates a CSV file for each text file. \n
    Skips CSV files that have already been created. \n
    For each <code>.json</code> file, the following info is gathered:
    <ul>
        <li><code>sentence_id</code> -- given for each sentence</li>
        <li><code>words</code> -- words gathered from text</li>
        <li><code>form</code> -- word form notation</li>
        <li><code>pos</code> -- part of speech</li>
        <li><code>type</code> -- text type (i.e. genre)</li>
        <li><code>source</code> -- file name where the text is taken from</li>
    </ul>
    <a href="https://github.com/Filosoft/vabamorf/blob/e6d42371006710175f7ec328c98f90b122930555/doc/tagset.md">Tables of morphological categories</a> for more information about <code>form</code> and <code>pos</code>.
    """
    print("Beginning tokenization file by file")
    for file_name in jsons:
        tokens = list()
        sentence_id = 0

        # Skipping previous CSV files
        csv_dir = '_plain_texts_csv'
        csv_file_name = file_name[:-4]+'csv'
        if os.path.exists(os.path.join(csv_dir, csv_file_name)):
            # print(f"Skipping {file_name} as {csv_file_name} already exists.")
            continue

        # print(f"Beginning to tokenize {file_name}")

        # Tokenization using estnltk
        text = estnltk.converters.json_to_text(file=os.path.join(in_dir, file_name))
        text_type = text.meta.get('texttype') # Text type
        if file_name.startswith('wiki17'):
            text_type = 'wikipedia'
        elif file_name.startswith('web13'):
            text_type = 'blogs_and_forums'
        morph_analysis = text.tag_layer('morph_analysis')
        for sentence in morph_analysis.sentences:
            sentence_analysis = sentence.morph_analysis
            for text, form, pos in zip(sentence_analysis.text, sentence_analysis.form, sentence_analysis.partofspeech):
                tokens.append((sentence_id, text, form[0], pos[0], text_type, file_name)) # In case of multiplicity, select the first or index 0
            sentence_id += 1

        # print(f"{file_name} tokenized, now saving")

        # Salvestamine
        with open(os.path.join(csv_dir, csv_file_name), 'w') as f:
            fieldnames = ['sentence_id', 'word', 'form', 'pos']
            writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
            writer.writerow(fieldnames)
            for row in tokens:
                writer.writerow(row)

        # print(f"{file_name} saved to {csv_file_name}\n")

    print("Tokenization completed successfully")

Reading text from files, tokenizing texts and save tokenized texts into one whole dataset. <!-- Failidest teksti lugemine, märgendamine ja salvestamine kokku üheks andmestikuks -->

In [13]:
def create_df():
    """
    Creates a new dataset for the token classification model. \n
    For each <code>.json</code> file, the following info is gathered:
    <ul>
        <li><code>sentence_id</code> -- given for each sentence</li>
        <li><code>words</code> -- words gathered from text</li>
        <li><code>form</code> -- word form notation</li>
        <li><code>pos</code> -- part of speech</li>
        <li><code>type</code> -- text type (i.e. genre)</li>
        <li><code>source</code> -- file name where the text is taken from</li>
    </ul>
    <a href="https://github.com/Filosoft/vabamorf/blob/e6d42371006710175f7ec328c98f90b122930555/doc/tagset.md">Tables of morphological categories</a> for more information about <code>form</code> and <code>pos</code>.
    """
    tokens = list()
    sentence_id = 0
    fieldnames = ['sentence_id', 'words', 'form', 'pos', 'type', 'source']

    print("Beginning tokenization file by file. This can take a while.")
    for file_name in jsons:
        # print(f"Beginning to tokenize {file_name}")
        sentence_id = 0

        # Tokenization
        text = estnltk.converters.json_to_text(file=os.path.join(in_dir, file_name))
        text_type = text.meta.get('texttype') # Text type
        if file_name.startswith('wiki17'):
            text_type = 'wikipedia'
        elif file_name.startswith('web13'):
            text_type = 'blogs_and_forums'
        morph_analysis = text.tag_layer('morph_analysis')
        for sentence in morph_analysis.sentences:
            sentence_analysis = sentence.morph_analysis
            for text, form, pos in zip(sentence_analysis.text, sentence_analysis.form, sentence_analysis.partofspeech):
                if text:
                    tokens.append((sentence_id, text, form[0], pos[0], text_type, file_name)) # In case of multiplicity, select the first or index 0
            sentence_id += 1
        # print(f"{file_name} tokenized")

    print("Tokenization completed successfully")
    print("Creating Pandas dataframe")
    df = pd.DataFrame(data=tokens, columns=fieldnames)
    csv_file_name = 'andmestik.csv'
    df.to_csv(path_or_buf=csv_file_name, index=False)
    print(f"Tokenized texts saved to {csv_file_name}\n")

In [14]:
def clean_df(df, df_file_name):
    """Finishes dataframe by:
    <ul>
        <li>filling NaN values in columns <code>form</code> and <code>pos</code>with empty strings;</li>
        <li>removes NaN words.</li>
    </ul>

    Args:
        df (pandas.core.frame.DataFrame): Pandas dataframe to clean
        df_file_name (str): CSV file name from which dataframe was created
    """
    print("Assigning NaN values in columns form and pos with an empty string")
    # NaN values are assigned with an empty string
    df['form'] = df['form'].fillna('')
    df['pos'] = df['pos'].fillna('')
    print("Removing NaN words")
    # Removing NaN words
    df.dropna(subset=['words'], inplace=True)
    df.to_csv(path_or_buf=df_file_name, index=False)
    print(f"Modified dataframe saved to {df_file_name}")

In [None]:
create_df()

In [15]:
csv_file = "andmestik.csv"
df = pd.read_csv(csv_file)
clean_df(df, csv_file)

Assigning NaN values in columns form and pos with an empty string
Removing NaN words
Modified dataframe saved to andmestik.csv


<a id='kontrollimine_1'></a>

#### Verification <!-- Kontrollimine -->

Ensuring that correct info was collected from the json file<!-- Kontrollitakse, et json failist said õiged infokillud kogutud -->

In [49]:
# For JSON file
text = estnltk.converters.json_to_text(file='.\\_plain_texts_json\\nc_255_27981.json')
morph_analysis = text.tag_layer('morph_analysis')
ma_json = morph_analysis.sentences[0].morph_analysis
test_json = pd.DataFrame(data=list(zip(ma_json.text, ma_json.form, ma_json.partofspeech)), columns=("text", "form", "partofspeech"))
# display(test_json)

In [None]:
# For CSV file
lines = list()

with open('.\\_plain_texts_csv\\nc_255_27981.csv', 'r') as f:
    csvFile = csv.reader(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for line in csvFile:
        if line:
            lines.append(line)

test_csv = pd.DataFrame(lines[1:], columns=lines[0])

# display(test_csv)

In [50]:
# For Pandas dataframe
test_csv = df[df['source'] == 'nc_255_27981.json']

Example of the first sentence in file `nc_255_27981`<!-- Näide esimesest lausest failis `nc_255_27981` -->

In [55]:
display(test_json[-15:])
display(test_csv[test_csv.sentence_id == 0][-15:])

Unnamed: 0,text,form,partofspeech
15,samas,(sg in),(P)
16,kestab,(b),(V)
17,kohtuvaidlus,(sg n),(S)
18,veel,(),(D)
19,30,(?),(N)
20,FIEna,(sg es),(Y)
21,ametis,(sg in),(S)
22,olnud,"(nud, , sg n, pl n)","(V, A, A, A)"
23,töötaja,(sg g),(S)
24,palga,(sg g),(S)


Unnamed: 0,sentence_id,words,form,pos,type,source
4861762,0,samas,sg in,P,periodicals,nc_255_27981.json
4861763,0,kestab,b,V,periodicals,nc_255_27981.json
4861764,0,kohtuvaidlus,sg n,S,periodicals,nc_255_27981.json
4861765,0,veel,,D,periodicals,nc_255_27981.json
4861766,0,30,?,N,periodicals,nc_255_27981.json
4861767,0,FIEna,sg es,Y,periodicals,nc_255_27981.json
4861768,0,ametis,sg in,S,periodicals,nc_255_27981.json
4861769,0,olnud,nud,V,periodicals,nc_255_27981.json
4861770,0,töötaja,sg g,S,periodicals,nc_255_27981.json
4861771,0,palga,sg g,S,periodicals,nc_255_27981.json


<a id='statistika'></a>

### Statistics <!-- Statistika -->

In [16]:
# Reading CSV file
csv_file = "andmestik.csv"
df = pd.read_csv(csv_file)

Token count in each text type<!-- Sõnade arv igal tekstiliigil -->

In [17]:
print("Word count")
display(df['type'].value_counts())
print(f'Sum: {df['type'].count()}')

Word count


type
fiction             2170692
science             2121583
periodicals         2099374
wikipedia           2007179
blogs_and_forums    1991307
Name: count, dtype: int64

Sum: 10390135


Sentence count in each text type<!-- Lausete arv igal tekstiliigil -->

In [18]:
text_types = df['type'].unique().tolist()
print("Sentence count")
for text_type in text_types:
    sentence_grouped = df[df['type'] == text_type].groupby(['source', 'sentence_id'])
    sentence_count = sentence_grouped.ngroups
    print(f"{text_type}: {sentence_count}")
sentence_grouped = df.groupby(['source', 'sentence_id'])
sentence_count = sentence_grouped.ngroups
print(f"\nSum: {sentence_count}")

Sentence count
periodicals: 144979
fiction: 165249
science: 129810
blogs_and_forums: 159826


wikipedia: 136377

Sum: 736241


File count in each text type<!-- Failide arv tekstiliigiti -->

In [19]:
print("File count")
for text_type in text_types:
    print(f"{text_type}: {len(df['source'][df['type'] == text_type].value_counts().tolist())}")
print(f"\nSum: {len(df['source'].value_counts().tolist())}")

File count
periodicals: 5917
fiction: 53
science: 229
blogs_and_forums: 3016
wikipedia: 9270

Sum: 18485


<a id='andmete_jagamine_treening_test_ja_valideerimishulkadesse'></a>

### Data splitting<!-- ### Andmete jagamine treening-, test- ja valideerimishulkadesse -->

The size of the data used in training was based on the [statistics](https://github.com/UniversalDependencies/UD_Estonian-EDT/blob/master/stats.xml) found in [Estonian UD EDT treebank](https://github.com/UniversalDependencies/UD_Estonian-EDT). Data is gathered in such a way that all text types have more or less proportionate numbers of words.

|Text type|Word count|
|---|---|
|Fiction|             115847|
|Blogs and forums|    115719|
|Science|             115347|
|Periodicals|         115050|
|Wikipedia|           115014|
|**Sum**|             576977|

The tag that the model starts to predict consists of a concatenation of columns `form` and `pos` connected with underscore _ symbol (e.g. `form` s and `pos` V will create s_V). In the data those concatenations will be stored in the column `labels`.

<a id='andmete_grupeerimine_tekstiliigiti'></a>

#### Grouping data by text type <!-- #### Andmete grupeerimine tekstiliigiti -->

In [2]:
# Reading CSV file
csv_file = "andmestik.csv"
df = pd.read_csv(csv_file)

Creation of a new column `labels` concatenating the values of columns `pos` (part of speech) and `form` (word form notation)<!-- Uue veeru `labels` loomine, mis konkateneerib veergude `pos`(sõnaliik) ja `form`(sõnavormi tähistus) väärtused -->

In [3]:
# New 'labels' column
df['labels'] = df.apply(lambda row: str(row['form']) + '_' + str(row['pos']) if pd.notna(row['form']) and pd.notna(row['pos']) else str(row['form']) or str(row['pos']), axis=1)

In [4]:
# Example of dataset
df.head(5)

Unnamed: 0,sentence_id,words,form,pos,type,source,labels
0,0,Teisipäeval,sg ad,S,periodicals,nc_10114_622631.json,sg ad_S
1,0,saabus,s,V,periodicals,nc_10114_622631.json,s_V
2,0,Iisraelist,sg el,H,periodicals,nc_10114_622631.json,sg el_H
3,0,Tallinna,sg g,H,periodicals,nc_10114_622631.json,sg g_H
4,0,lennujaama,sg g,S,periodicals,nc_10114_622631.json,sg g_S


Obtaining unique tags from the column `label` that the model must predict<!-- Unikaalsete väärtuste saamine veerust `label`, mida mudel peab hakkama ennustama -->

In [5]:
unique_labels = df['labels'].unique().tolist()

A function that collects texts for each text type *more or less* in proportion to the number of words given as `n` <!-- Funktsioon, millega kogutakse iga tekstiliigi kohta tekste enam-vähem proportsionaalselt sõnade arvu suhtes -->

In [6]:
# Function to collect texts file by file
def gather_rows_for_type(group, n):
    """Gathers about `n` (>= n) rows for each text type\n
    Ensures that all text types have about the same number of words

    Args:
        group (): Pandas dataframe group
        n (int): number of words to gather

    Returns:
        _type_: Gathered rows for each text type
    """
    gathered_rows = pd.DataFrame()
    sources = group['source'].unique()

    for source in sources:
        source_rows = group[group['source'] == source]
        gathered_rows = pd.concat([gathered_rows, source_rows])
        if len(gathered_rows) >= n:
            break

    return gathered_rows

grouped = df.groupby('type')
data = pd.concat([gather_rows_for_type(group, 115000) for _, group in grouped]) #115000 words per text type for NER model
display(data)

Unnamed: 0,sentence_id,words,form,pos,type,source,labels
6391649,0,Kas,,D,blogs_and_forums,web13_206014_x.json,
6391650,0,liigselt,,D,blogs_and_forums,web13_206014_x.json,
6391651,0,lihastes,pl in,S,blogs_and_forums,web13_206014_x.json,pl in_S
6391652,0,bikiine,pl p,S,blogs_and_forums,web13_206014_x.json,pl p_S
6391653,0,kandev,sg n,A,blogs_and_forums,web13_206014_x.json,sg n_A
...,...,...,...,...,...,...,...
8497965,2,aasta,sg g,S,wikipedia,wiki17_112333_x.json,sg g_S
8497966,2,põhja,sg g,S,wikipedia,wiki17_112333_x.json,sg g_S
8497967,2,suusaalade,pl g,S,wikipedia,wiki17_112333_x.json,pl g_S
8497968,2,maailmameistrivõistlused,pl n,S,wikipedia,wiki17_112333_x.json,pl n_S


In [10]:
# Result
display(data['type'].value_counts())
print(f"Sum: {data['words'].size}")

type
fiction             115847
blogs_and_forums    115719
science             115347
periodicals         115050
wikipedia           115014
Name: count, dtype: int64

Sum: 576977


In [11]:
text_types = data['type'].unique().tolist()

In [12]:
def check_each_source_word_count(data):
    """Outputs word count for each source (i.e. text file) present in the data

    Args:
        data (pandas.core.frame.DataFrame): Pandas dataframe
    """
    for text_type in text_types:
        print(f"{text_type}, sources: {len(data['source'][data['type'] == text_type].value_counts().tolist())}")
        display(data['source'][data['type'] == text_type].value_counts().head(5))
        if len(data['source'][data['type'] == text_type].value_counts().tolist()) > 5:
            display(data['source'][data['type'] == text_type].value_counts().tail(5))

In [13]:
data.to_csv(path_or_buf='model_data.csv', index=False)

In [15]:
# unique_labels

<a id='andmete_jagamine_hulkadesse'></a>

#### Splitting data to sets<!-- #### Andmete jagamine hulkadesse -->

In [16]:
# Reading CSV file
data = pd.read_csv("model_data.csv")

Grouping data by filename to preserve the integrity of texts<!-- Andmete grupeerimine failinimede kaupa, et säilitada tekstide terviklikkus -->

In [10]:
grouped = data.groupby('source')

groups = list(grouped.groups.keys())
train_groups, test_groups = sk.model_selection.train_test_split(groups, test_size=0.2, random_state=42)

def filter_by_group(df, groups):
    return df[df['source'].isin(groups)]

# Splitting dataframe
train_df = filter_by_group(df, train_groups)
test_df = filter_by_group(df, test_groups)

Removing unnecessary columns from the model<!-- Mudelile ebavajalike veergude eemaldamine -->

In [None]:
train_df = train_df.drop(labels=['type', 'source'], axis=1)
test_df = test_df.drop(labels=['type', 'source'], axis=1)
# display(train_df)
# display(test_df)

```
#pos_labels = ['A', 'C', 'D', 'G', 'H', 'I', 'J', 'K', 'N', 'O', 'P', 'S', 'U', 'V', 'X', 'Y', 'Z']
#form_labels = ['ab', 'abl', 'ad', 'adt', 'all', 'el', 'es', 'g', 'ill', 'in', 'kom', 'n', 'p', 'pl', 'sg', 'ter', 'tr', 'b', 'd', 'da', 'des', 'ge', 'gem', 'gu', 'ks', 'ksid', 'ksime', 'ksin', 'ksite', 'ma', 'maks', 'mas', 'mast', 'mata', 'me', 'n', 'neg', 'nud', 'nuks', 'nuksid', 'nuksime', 'nuksin', 'nuksite', 'nuvat', 'o', 's', 'sid', 'sime', 'sin', 'site', 'ta', 'tagu', 'taks', 'takse', 'tama', 'tav', 'tavat', 'te', 'ti', 'tud', 'tuks', 'tuvat', 'v', 'vad', 'vat']
```

<a id='ner_mudel'></a>

### Token classification model (for assigning morpological categories)<!-- ### NER Mudel -->

Initializing the model<!-- Mudeli ülesehitamine -->

In [55]:
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

# Configurations
model_args = NERArgs()
model_args.train_batch_size = 8
model_args.evaluate_during_training = False
model_args.learning_rate = 5e-5
model_args.num_train_epochs = 10
model_args.use_early_stopping = True
model_args.use_cuda = torch.cuda.is_available()  # Use GPU if available
model_args.save_eval_checkpoints = False
model_args.save_model_every_epoch = False # Takes a lot of storage space
model_args.save_steps = -1
model_args.overwrite_output_dir = True
model_args.cache_dir = 'NER_mudel/cache'
model_args.best_model_dir = 'NER_mudel/best_model'
model_args.output_dir = 'NER_mudel'
model_args.use_multiprocessing = False

# Initialization
model = NERModel("camembert", "EMBEDDIA/est-roberta", args=model_args, labels=unique_labels)

Some weights of CamembertForTokenClassification were not initialized from the model checkpoint at EMBEDDIA/est-roberta and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Training the model<!-- Mudeli treenimine -->

In [57]:
if not os.path.exists(".\\NER_mudel\\"):
    # Train model
    print("Training model")
    model.train_model(train_df, eval_data=test_df)
else:
    model = NERModel("camembert", "NER_mudel", args=model_args, labels=unique_labels)

Evaluating the model<!-- Mudeli hindamine -->

In [20]:
# Evaluate the model
result, model_outputs, preds_list = model.eval_model(test_df)
print(result)

INFO:ner_model.py:1884:  Converting to features started.


100%|██████████| 1/1 [01:01<00:00, 61.29s/it]
  with amp.autocast():
Running Evaluation: 100%|██████████| 4/4 [00:00<00:00,  7.83it/s]


INFO:ner_model.py:1253: {'eval_loss': 0.24351004883646965, 'precision': 0.9554977345116863, 'recall': 0.9531033248310735, 'f1_score': 0.9542990277342175}
{'eval_loss': 0.24351004883646965, 'precision': 0.9554977345116863, 'recall': 0.9531033248310735, 'f1_score': 0.9542990277342175}


As you can see a lot of UserWarnings are logged in the output. This does not impose a problem because the model can still predict these tags.

Evaluating the model by text type<!-- Mudeli hindamine tekstiliigiti -->

In [None]:
for text_type in text_types:

    print(f"Text type:\t{text_type}")

    text_type_data = data.loc[data['type'] == text_type]

    print(f"Token count: {len(text_type_data)}")

    print(f"Starting evaluation")

    # Evaluate the model
    result, model_outputs, preds_list = model.eval_model(text_type_data)
    print(result)

Make predictions with the model <!-- Mudelil ennustamine -->

In [21]:
# Predict tags
predictions, raw_outputs = model.predict(["Kas toas on soe või palav, külm või jahe? Minu tunnetus ütleb, et ma pole tegelikult toaski."])

# Output of predictions
print(predictions)

INFO:ner_model.py:1884:  Converting to features started.


100%|██████████| 1/1 [00:00<?, ?it/s]
  with amp.autocast():
Running Prediction: 100%|██████████| 1/1 [00:00<00:00, 76.94it/s]

[[{'Kas': 'D'}, {'toas': 'sg in_S'}, {'on': 'b_V'}, {'soe': 'sg n_A'}, {'või': 'J'}, {'palav,': 'sg n_A'}, {'külm': 'sg n_A'}, {'või': 'J'}, {'jahe?': 'sg n_A'}, {'Minu': 'sg g_P'}, {'tunnetus': 'sg n_S'}, {'ütleb,': 'b_V'}, {'et': 'J'}, {'ma': 'sg n_P'}, {'pole': 'neg o_V'}, {'tegelikult': 'D'}, {'toaski.': 'sg in_S'}]]





<a id='ner_mudel_tekstiliigiti'></a>

### Token classification model (for assigning morpological categories) NER model by text type<!-- ### NER mudel tekstiliigiti -->

In [25]:
for text_type in text_types:

    print(f"Text type:\t{text_type}")

    text_type_data = data.loc[data['type'] == text_type]

    print(f"Token count: {len(text_type_data)}")

    grouped = data.groupby('source')

    groups = list(grouped.groups.keys())
    train_groups, test_groups = sk.model_selection.train_test_split(groups, test_size=0.2, random_state=42)

    # Split the dataframe
    train_df = df[df['source'].isin(train_groups)]
    test_df = df[df['source'].isin(test_groups)]

    print(f"Train- and testset created.\Beginning to train")

    # Model

    logging.basicConfig(level=logging.INFO)
    transformers_logger = logging.getLogger("transformers")
    transformers_logger.setLevel(logging.WARNING)

    # Configure the model
    model_args = NERArgs()
    model_args.train_batch_size = 8
    model_args.evaluate_during_training = False
    model_args.learning_rate = 5e-5
    model_args.num_train_epochs = 10
    model_args.use_early_stopping = True
    model_args.use_cuda = torch.cuda.is_available()  # Use GPU if available
    model_args.overwrite_output_dir = True
    model_args.save_eval_checkpoints = False
    model_args.save_model_every_epoch = False
    model_args.save_steps = -1
    model_args.cache_dir = text_type + '_NER_mudel/cache'
    model_args.best_model_dir = text_type + '_NER_mudel/best_model'
    model_args.output_dir = text_type + '_NER_mudel'

    # Initialize the model
    model = NERModel("camembert", "EMBEDDIA/est-roberta", args=model_args, labels=unique_labels)

    # Train the model
    model.train_model(train_df, eval_data=test_df)

    print(f"Starting evaluation")

    # Evaluate the model
    result, model_outputs, preds_list = model.eval_model(text_type_data)
    print(result)

Tekstiliik:	blogs_and_forums
Suurus: 115719 sõna
Alustan hindamisega
INFO:ner_model.py:1884:  Converting to features started.


100%|██████████| 2/2 [00:08<00:00,  4.49s/it]
  with amp.autocast():
Running Evaluation: 100%|██████████| 9/9 [00:01<00:00,  6.79it/s]


INFO:ner_model.py:1253: {'eval_loss': 0.08652005354977316, 'precision': 0.9740840914679124, 'recall': 0.9738446411012782, 'f1_score': 0.9739643515673017}
{'eval_loss': 0.08652005354977316, 'precision': 0.9740840914679124, 'recall': 0.9738446411012782, 'f1_score': 0.9739643515673017}
Tekstiliik:	fiction
Suurus: 115847 sõna
Alustan hindamisega
INFO:ner_model.py:1884:  Converting to features started.


100%|██████████| 15/15 [00:18<00:00,  1.24s/it]
Running Evaluation: 100%|██████████| 75/75 [00:12<00:00,  6.13it/s]


INFO:ner_model.py:1253: {'eval_loss': 0.0043835937487892804, 'precision': 0.9991472669907052, 'recall': 0.9991046685141761, 'f1_score': 0.9991259672983863}
{'eval_loss': 0.0043835937487892804, 'precision': 0.9991472669907052, 'recall': 0.9991046685141761, 'f1_score': 0.9991259672983863}
Tekstiliik:	periodicals
Suurus: 115050 sõna
Alustan hindamisega
INFO:ner_model.py:1884:  Converting to features started.


100%|██████████| 1/1 [00:09<00:00,  9.00s/it]
Running Evaluation: 100%|██████████| 3/3 [00:00<00:00,  6.96it/s]


INFO:ner_model.py:1253: {'eval_loss': 0.11051202068726222, 'precision': 0.9810868635002139, 'recall': 0.9799128130609454, 'f1_score': 0.9804994868286008}
{'eval_loss': 0.11051202068726222, 'precision': 0.9810868635002139, 'recall': 0.9799128130609454, 'f1_score': 0.9804994868286008}
Tekstiliik:	science
Suurus: 115347 sõna
Alustan hindamisega
INFO:ner_model.py:1884:  Converting to features started.


100%|██████████| 1/1 [00:08<00:00,  8.86s/it]
Running Evaluation: 100%|██████████| 4/4 [00:00<00:00,  6.16it/s]


INFO:ner_model.py:1253: {'eval_loss': 0.1578108500689268, 'precision': 0.9701866365031914, 'recall': 0.971324570273003, 'f1_score': 0.9707552699124881}
{'eval_loss': 0.1578108500689268, 'precision': 0.9701866365031914, 'recall': 0.971324570273003, 'f1_score': 0.9707552699124881}
Tekstiliik:	wikipedia
Suurus: 115014 sõna
Alustan hindamisega
INFO:ner_model.py:1884:  Converting to features started.


100%|██████████| 1/1 [00:09<00:00,  9.05s/it]
Running Evaluation: 100%|██████████| 4/4 [00:00<00:00,  8.53it/s]


INFO:ner_model.py:1253: {'eval_loss': 0.32173459976911545, 'precision': 0.9479583095374072, 'recall': 0.9465393114263312, 'f1_score': 0.9472482790598138}
{'eval_loss': 0.32173459976911545, 'precision': 0.9479583095374072, 'recall': 0.9465393114263312, 'f1_score': 0.9472482790598138}


New trained and evaluated NER models

| Text type         | Token count | eval_loss            | precision          | recall             | f1_score           |
|-------------------|-------------|----------------------|--------------------|--------------------|--------------------|
| blogs_and_forums  | 115719      | 0.23976334929466248  | 0.9538220130015691 | 0.9520293551707164 | 0.9529248409925647 |
| fiction           | 115847      | 0.2396537810564041   | 0.9538792523867151 | 0.952342596321654  | 0.9531103049845493 |
| periodicals       | 115050      | 0.24317898601293564  | 0.9550244383659926 | 0.9530585760952253 | 0.9540404945350295 |
| science           | 115347      | 0.2376849427819252   | 0.9557577659240665 | 0.9541325457555824 | 0.9549444643496955 |
| wikipedia         | 115014      | 0.23429438471794128  | 0.9547409930094999 | 0.953416565982011  | 0.9540783198620781 |

General NER model evaluations by text type

| Text type         | Token count | eval_loss            | precision          | recall             | f1_score           |
|-------------------|-------------|----------------------|--------------------|--------------------|--------------------|
| blogs_and_forums  | 115719      | 0.08652005354977316   | 0.9740840914679124 | 0.9738446411012782 | 0.9739643515673017 |
| fiction           | 115847      | 0.0043835937487892804 | 0.9991472669907052 | 0.9991046685141761 | 0.9991259672983863 |
| periodicals       | 115050      | 0.11051202068726222   | 0.9810868635002139 | 0.9799128130609454 | 0.9804994868286008 |
| science           | 115347      | 0.1578108500689268    | 0.9701866365031914 | 0.971324570273003  | 0.9707552699124881 |
| wikipedia         | 115014      | 0.32173459976911545   | 0.9479583095374072 | 0.9465393114263312 | 0.9472482790598138 |

<a id='top_n_margendi_ennustamine'></a>

### Top n tags prediction<!-- ### Top n märgendi ennustamine -->

In [82]:
# Determine the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Ensure model is on the right device
model.model.to(device)

# Sample input sentence
sentence = "Uuringutulemuste interpretatsioon jätab mitmed küsimused vastuseta."

# Tokenize and predict (This is normally done internally by simpletransformers)
inputs = model.tokenizer(sentence, return_tensors="pt")
inputs = {key: value.to(model.device) for key, value in inputs.items()}

# Get the logits from the model
model_output = model.model(**inputs)
logits = model_output.logits  # Ensure logits are on the same device

# Apply softmax to logits to get probabilities
probs = torch.nn.functional.softmax(logits, dim=-1)

# Get top n tags for each token
top_n = 3
top_n_probs, top_n_indices = torch.topk(probs, top_n, dim=-1)

# Decode the tags
for token_idx in range(len(inputs["input_ids"][0])):
    token = model.tokenizer.decode([inputs["input_ids"][0][token_idx]])
    print(f"Token: {token}")
    for i in range(top_n):
        label_idx = top_n_indices[0][token_idx][i].item()
        label_prob = top_n_probs[0][token_idx][i].item()
        label = model.args.labels_list[label_idx]
        print(f"  Tag {i+1}: {label} (Probability: {label_prob:.4f})")

Tooken: <s>
   Märk 1: Z (Tõenäosus: 0.2809)
   Märk 2: sg n_S (Tõenäosus: 0.1308)
Tooken: Uuringu
   Märk 1: pl g_S (Tõenäosus: 0.9995)
   Märk 2: pl g_H (Tõenäosus: 0.0001)
Tooken: tule
   Märk 1: pl g_S (Tõenäosus: 0.9996)
   Märk 2: pl g_H (Tõenäosus: 0.0000)
Tooken: muste
   Märk 1: pl g_S (Tõenäosus: 0.9996)
   Märk 2: pl g_H (Tõenäosus: 0.0000)
Tooken: inter
   Märk 1: sg n_S (Tõenäosus: 0.9998)
   Märk 2: sg n_H (Tõenäosus: 0.0000)
Tooken: pre
   Märk 1: sg n_S (Tõenäosus: 0.9997)
   Märk 2: sg n_H (Tõenäosus: 0.0000)
Tooken: tatsioon
   Märk 1: sg n_S (Tõenäosus: 0.9996)
   Märk 2: sg n_H (Tõenäosus: 0.0000)
Tooken: jätab
   Märk 1: b_V (Tõenäosus: 0.9999)
   Märk 2: s_V (Tõenäosus: 0.0000)
Tooken: mitmed
   Märk 1: pl n_P (Tõenäosus: 0.9998)
   Märk 2: pl n_A (Tõenäosus: 0.0000)
Tooken: küsimused
   Märk 1: pl n_S (Tõenäosus: 0.9996)
   Märk 2: pl n_A (Tõenäosus: 0.0001)
Tooken: vastuse
   Märk 1: sg ab_S (Tõenäosus: 0.9892)
   Märk 2: mata_V (Tõenäosus: 0.0037)
Tooken: ta
  

In [83]:
# Make predictions with the model
predictions, raw_outputs = model.predict(["Uuringutulemuste interpretatsioon jätab mitmed küsimused vastuseta."])

# Output the predictions
print(predictions)

INFO:ner_model.py:1884:  Converting to features started.


100%|██████████| 1/1 [00:00<?, ?it/s]
  with amp.autocast():
Running Prediction: 100%|██████████| 1/1 [00:00<00:00, 71.42it/s]

[[{'Uuringutulemuste': 'pl g_S'}, {'interpretatsioon': 'sg n_S'}, {'jätab': 'b_V'}, {'mitmed': 'pl n_P'}, {'küsimused': 'pl n_S'}, {'vastuseta.': 'sg ab_S'}]]





<a id='mudeli_testimine'></a>

### Model testing

### END

<a id='end'></a>