# Bert-based morphological tagger <!-- Morfoloogiline ühestaja -->

## Table of contents
[Code](#kood)

   1. [**enc2017 corpus**](#enc_korpus)
      1. [**Defined functions**](#defineeritud_funktsioonid)
      2. [**Collecting and tokenizing data**](#andmete_lugemine_kogumine_ja_margendamine)
         * [Verification](#kontrollimine_1)
      3. [**Statistics**](#statistika)
      4. [**Data splitting**](#andmete_jagamine_treening_test_ja_valideerimishulkadesse)
         * [Grouping data by text type](#andmete_grupeerimine_tekstiliigiti)
         * [Splitting data to sets](#andmete_jagamine_hulkadesse)
      5. [**Token classification model (for assigning morpological categories)**](#ner_mudel)
      6. [*Token classification model (for assigning morpological categories) by text type*](#ner_mudel_tekstiliigiti)
      7. [**Top n tags prediction**](#top_n_margendi_ennustamine)
      8. [**Gathering unused data and comparing Vabamorf / NER model**](#kasutamata_andmete_kogumine)

[End](#end)

<!-- ## TODO
1) - [ ] Neurovõrkudel põhineva morfoloogilise ühestaja loomine ja integreerimine EstNLTK teegiga:
    1) - [ ] mudeli treenimiseks vajaliku töövoo loomine;
    2) - [ ] töövoo kasutamine mudeli treenimiseks;
    3) - [ ] mudeli integreerimine EstNLTK teeki.

2) - [ ] Morfoloogilise ühestaja treenimiseks vajalike andmestike moodustamine olemasolevatest andmestikest:
    1) - [ ] morfoloogilisemärgenduse loomine kasutades olemaolevaid EstNLTK automaatmärgendamise töövoogusid;
    2) - [ ] morfoloogilisemärgenduse loomine kasutades UD puudepanga andmeid;

3) - [ ] Saadud ühestaja tulemuste võrdlemine teiste kasutatavate rakendustega:
    1) - [ ] täpsuse ja saagise absoluutne hindamine testvalimil;
    2) - [ ] täpsuse ja saagise relatiivne hindamine märgendamata testandmetel;
    3) - [ ] uute informatiivsete treeningandmete välja valimine.

4) - [ ] Juhendmaterjalide  loomine uue ühestaja kasutamiseks
5) - [ ] Zip the model somehow, https://github.com/estnltk/estnltk/blob/main/estnltk_neural/estnltk_neural/taggers/ner/estbertner_tagger.py -->

## Code <!-- Kood -->

<a id='kood'></a>

In [1]:
import os
import csv
import re
import gc
import itertools
import json
import logging
import warnings
import pkg_resources
import types
import pandas as pd
import sklearn as sk
import numpy as np
import estnltk
import torch
import math
import simpletransformers
from simpletransformers.ner import NERModel, NERArgs
from est_ud_utils import load_ud_file_texts_with_corrections, load_ud_file_with_corrections
from est_ud_morph_conv import convert_ud_layer_to_reduced_morph_layer
from estnltk_neural.taggers.embeddings.bert.bert_tokens_to_words_rewriter import BertTokens2WordsRewriter

  from .autonotebook import tqdm as notebook_tqdm


INFO:config.py:58: PyTorch version 2.4.0 available.


In [2]:
# Get locally imported modules from current notebook - https://stackoverflow.com/questions/40428931/package-for-listing-version-of-packages-used-in-a-jupyter-notebook - Alex P. Miller
def get_imports():
    
    for name, val in globals().items():
        if isinstance(val, types.ModuleType):
            # Split ensures you get root package, 
            # not just imported function
            name = val.__name__.split(".")[0]

        elif isinstance(val, type):
            name = val.__module__.split(".")[0]
            
        # Some packages are weird and have different
        # imported names vs. system/pip names. Unfortunately,
        # there is no systematic way to get pip names from
        # a package's imported name. You'll have to add
        # exceptions to this list manually!
        poorly_named_packages = {
            "PIL": "Pillow",
            "sklearn": "scikit-learn"
        }
        if name in poorly_named_packages.keys():
            name = poorly_named_packages[name]
            
        yield name
imports = list(set(get_imports()))

# The only way I found to get the version of the root package
# from only the name of the package is to cross-check the names 
# of installed packages vs. imported packages
requirements = []
for m in pkg_resources.working_set:
    if m.project_name in imports and m.project_name!="pip":
        requirements.append((m.project_name, m.version))

for r in requirements:
    print("{}=={}".format(*r))

estnltk==1.7.3
numpy==1.26.4
pandas==2.2.2
scikit-learn==1.5.1
simpletransformers==0.70.1
torch==2.4.0


### enc2017 corpus

Data has been gathered from the [corpus](https://github.com/estnltk/eval_experiments_lrec_2020/blob/master/scripts_and_data/enc2017_selection_plain_texts_json.zip) that was used in evaluation experiments reported in LREC 2020 paper "EstNLTK 1.6: Remastered Estonian NLP Pipeline"[^1].

[^1]: Sven Laur, Siim Orasmaa, Dage Särg, Paul Tammo. "EstNLTK 1.6: Remastered Estonian NLP Pipeline". *Proceedings of The 12th Language Resources and Evaluation Conference*. European Language Resources Association: Marseille, France, May 2020, p. 7154-7162

<a id='enc_korpus'></a>

#### Defined functions <!-- Defineeritud funktsioonid -->

<a id='defineeritud_funktsioonid'></a>

In [62]:
def create_csv_file_by_file_enc2017(jsons, in_dir):
    """
    Creates a CSV file for each text file. \n
    Skips CSV files that have already been created. \n
    For each <code>.json</code> file, the following info is gathered:
    <ul>
        <li><code>sentence_id</code> -- given for each sentence</li>
        <li><code>words</code> -- words gathered from text</li>
        <li><code>form</code> -- word form notation</li>
        <li><code>pos</code> -- part of speech</li>
        <li><code>type</code> -- text type (i.e. genre)</li>
        <li><code>source</code> -- file name where the text is taken from</li>
    </ul>
    <a href="https://github.com/Filosoft/vabamorf/blob/e6d42371006710175f7ec328c98f90b122930555/doc/tagset.md">Tables of morphological categories</a> for more information about <code>form</code> and <code>pos</code>.

    Args:
        jsons (list): List of json files from which to read in the text
    """
    print("Beginning tokenization file by file")
    for file_name in jsons:
        tokens = list()
        sentence_id = 0

        # Skipping previous CSV files
        csv_dir = '_plain_texts_csv'
        csv_file_name = file_name[:-4]+'csv'
        if os.path.exists(os.path.join(csv_dir, csv_file_name)):
            # print(f"Skipping {file_name} as {csv_file_name} already exists.")
            continue

        # print(f"Beginning to tokenize {file_name}")

        # Tokenization using estnltk
        text = estnltk.converters.json_to_text(file=os.path.join(in_dir, file_name))
        text_type = text.meta.get('texttype') # Text type
        if file_name.startswith('wiki17'):
            text_type = 'wikipedia'
        elif file_name.startswith('web13'):
            text_type = 'blogs_and_forums'
        morph_analysis = text.tag_layer('morph_analysis')
        for sentence in morph_analysis.sentences:
            sentence_analysis = sentence.morph_analysis
            for text, form, pos in zip(sentence_analysis.text, sentence_analysis.form, sentence_analysis.partofspeech):
                if text:
                    tokens.append((sentence_id, text, form[0], pos[0], text_type, file_name)) # In case of multiplicity, select the first or index 0
            sentence_id += 1

        # print(f"{file_name} tokenized, now saving")

        # Salvestamine
        with open(os.path.join(csv_dir, csv_file_name), 'w') as f:
            fieldnames = ['sentence_id', 'word', 'form', 'pos']
            writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
            writer.writerow(fieldnames)
            for row in tokens:
                writer.writerow(row)

        # print(f"{file_name} saved to {csv_file_name}\n")

    print("Tokenization completed successfully")

Reading text from files, tokenizing texts and save tokenized texts into one whole dataset. <!-- Failidest teksti lugemine, märgendamine ja salvestamine kokku üheks andmestikuks -->

In [45]:
def create_df_enc2017(jsons, in_dir, csv_filename):
    """
    Creates a new dataset from converted the Estonian UD EDT <a href="https://github.com/UniversalDependencies/UD_Estonian-EDT">corpus</a>. \n
    For each <code>.json</code> file, the following info is gathered:
    <ul>
        <li><code>sentence_id</code> -- given for each sentence</li>
        <li><code>words</code> -- words gathered from text</li>
        <li><code>form</code> -- word form notation</li>
        <li><code>pos</code> -- part of speech</li>
        <li><code>file_prefix</code> -- metadata</li>
        <li><code>source</code> -- file name where the text is taken from</li>
    </ul>
    <a href="https://github.com/Filosoft/vabamorf/blob/e6d42371006710175f7ec328c98f90b122930555/doc/tagset.md">Tables of morphological categories</a> for more information about <code>form</code> and <code>pos</code>.

    Args:
        jsons (list): List of json files from which to read in the text
        in_dir (string): Directory containing list of files (<code>jsons</code>)
        csv_filename (string): CSV filename where to save the gathered text
    """
    tokens = list()
    sentence_id = 0
    fieldnames = ['sentence_id', 'words', 'form', 'pos', 'type', 'source']

    print("Beginning tokenization file by file. This can take a while.")
    for file_name in jsons:
        # print(f"Beginning to tokenize {file_name}")
        sentence_id = 0

        # Tokenization
        text = estnltk.converters.json_to_text(file=os.path.join(in_dir, file_name))
        text_type = text.meta.get('texttype') # Text type
        if file_name.startswith('wiki17'):
            text_type = 'wikipedia'
        elif file_name.startswith('web13'):
            text_type = 'blogs_and_forums'
        morph_analysis = text.tag_layer('morph_analysis')
        for sentence in morph_analysis.sentences:
            sentence_analysis = sentence.morph_analysis
            for text, form, pos in zip(sentence_analysis.text, sentence_analysis.form, sentence_analysis.partofspeech):
                if text:
                    tokens.append((sentence_id, text, form[0], pos[0], text_type, file_name)) # In case of multiplicity, select the first or index 0
            sentence_id += 1
        # print(f"{file_name} tokenized")

    print("Tokenization completed successfully")
    print("Creating Pandas dataframe")
    df = pd.DataFrame(data=tokens, columns=fieldnames)
    df.to_csv(path_or_buf=csv_filename, index=False)
    print(f"Tokenized texts saved to {csv_filename}\n")

Cleaning dataset by:
    <ul>
        <li>filling NaN values in columns <code>form</code> and <code>pos</code> with empty strings;</li>
        <li>removing NaN words.</li>
    </ul>

In [46]:
def clean_df(df, df_file_name=None):
    """Finishes dataframe by:
    <ul>
        <li>filling NaN values in columns <code>form</code> and <code>pos</code> with empty strings;</li>
        <li>removing NaN words.</li>
    </ul>

    Args:
        df (pandas.core.frame.DataFrame): Pandas dataframe to clean
        df_file_name (str): CSV file name from which dataframe was created
    """
    print("Assigning NaN values in columns form and pos with an empty string")
    # NaN values are assigned with an empty string
    df['form'] = df['form'].fillna('')
    df['pos'] = df['pos'].fillna('')
    print("Removing NaN words")
    # Removing NaN words
    df.dropna(subset=['words'], inplace=True)
    if df_file_name:
        df.to_csv(path_or_buf=df_file_name, index=False)
        print(f"Modified dataframe saved to {df_file_name}")
    else:
        print("Dataframe cleaned")

Creation of a new column `labels` concatenating the values of columns `pos` (part of speech) and `form` (word form notation)<!-- Uue veeru `labels` loomine, mis konkateneerib veergude `pos`(sõnaliik) ja `form`(sõnavormi tähistus) väärtused -->

In [47]:
# New 'labels' column
def create_labels_column(df, df_file_name=None):
    """
    Creates a new column <code>labels</code> concatenating the values of columns <code>pos</code> (part of speech) and <code>form</code> (word form notation)

    Args:
        df (pandas.core.frame.DataFrame): Pandas dataframe to create a new column
        df_file_name (str): CSV file name from which dataframe was created
    """
    print("Creating column 'labels'")
    df['labels'] = df.apply(lambda row: str(row['form']) + '_' + str(row['pos']) if row['form'] and row['pos'] else str(row['form']) or str(row['pos']), axis=1)
    print("Column 'labels' created")
    if df_file_name:
        df.to_csv(path_or_buf=df_file_name, index=False)
        print(f"Modified dataframe saved to {df_file_name}")

Obtaining unique tags that the model must predict<!-- Unikaalsete väärtuste saamine, mida mudel peab hakkama ennustama -->

In [50]:
def get_unique_labels():
    """Creates list of unique labels that the model must predict by creating all possible combinations of POS (Part Of Speech) and form.

    <i>Gathering unique labels from the enc2017 database proved to be insufficient for future model evaluation,
    because the database does not contain all possible combinations of POS and form.
    Evaluating model with UD Est-EDT test corpus proved that this problem existed.</i>

    Returns:
        list: List of unique labels
    """
    # Separately, if one of two doesn't exist
    pos_labels = ['A', 'C', 'D', 'G', 'H', 'I', 'J', 'K', 'N', 'O', 'P', 'S', 'U', 'V', 'X', 'Y', 'Z']
    form_labels = ['ab', 'abl', 'ad', 'adt', 'all', 'el', 'es', 'g', 'ill', 'in', 'kom', 'n', 'p', 'pl', 'sg', 'ter', 'tr', 'b', 'd', 'da', 'des', 'ge', 'gem', 'gu', 'ks', 'ksid', 'ksime', 'ksin', 'ksite', 'ma', 'maks', 'mas', 'mast', 'mata', 'me', 'n', 'neg', 'nud', 'nuks', 'nuksid', 'nuksime', 'nuksin', 'nuksite', 'nuvat', 'o', 's', 'sid', 'sime', 'sin', 'site', 'ta', 'tagu', 'taks', 'takse', 'tama', 'tav', 'tavat', 'te', 'ti', 'tud', 'tuks', 'tuvat', 'v', 'vad', 'vat']

    pos_labels_mutable = ['A', 'C', 'N', 'H', 'O', 'P', 'S', 'U', 'Y', 'X']
    #pos_labels_immutable = ['D', 'G', 'I', 'J', 'K', 'X', 'Y', 'Z']
    pos_label_verb = ['V']
    form_labels_noun = ['ab', 'abl', 'ad', 'adt', 'all', 'el', 'es', 'g', 'ill', 'in', 'kom', 'n', 'p', 'ter', 'tr']
    form_labels_noun_count = ['pl', 'sg']
    form_labels_verb = ['b', 'd', 'da', 'des', 'ge', 'gem', 'gu', 'ks', 'ksid', 'ksime', 'ksin', 'ksite', 'ma', 'maks', 'mas', 'mast', 'mata', 'me', 'n', 'neg', 'neg ge', 'neg gem', 'neg gu', 'neg ks', 'neg me', 'neg nud', 'neg nuks', 'neg o', 'neg vat', 'neg tud', 'nud', 'nuks', 'nuksid', 'nuksime', 'nuksin', 'nuksite', 'nuvat', 'o', 's', 'sid', 'sime', 'sin', 'site', 'ta', 'tagu', 'taks', 'takse', 'tama', 'tav', 'tavat', 'te', 'ti', 'tud', 'tuks', 'tuvat', 'v', 'vad', 'vat']

    noun_labels_without_pos = list(itertools.product(form_labels_noun_count, form_labels_noun))
    noun_labels_nested = list(itertools.product(noun_labels_without_pos, pos_labels_mutable))
    form_pos_labels = list(itertools.product(form_labels, pos_labels))
    noun_labels = list()
    verb_labels = list(itertools.product(form_labels_verb, pos_label_verb))

    # Connect count and form in mutables
    for form, pos in noun_labels_nested:
        noun_labels.append((form[0] + ' ' + form[1], pos))

    pos_label_only = [('', pos) for pos in pos_labels]
    form_label_only = [(form, '') for form in form_labels]
    unknown_form_labels = [('?', pos) for pos in pos_labels] # form '?' comes from enc2017 corpus after tokenization

    unique_labels = pos_label_only + form_label_only + noun_labels + verb_labels + unknown_form_labels + form_pos_labels + ['?'] # '?' for labels unknown to Vabamorf

    unique_labels_df = pd.DataFrame(unique_labels, columns=['form', 'pos'])
    create_labels_column(unique_labels_df)
    unique_labels_df.drop(labels=['form', 'pos'], axis=1)
    print("List of unique labels created")
    return unique_labels_df['labels'].tolist()

A function that collects texts for each text type *more or less* in proportion to the number of words given as `n` <!-- Funktsioon, millega kogutakse iga tekstiliigi kohta tekste enam-vähem proportsionaalselt sõnade arvu suhtes -->

In [51]:
# Function to collect texts file by file
def gather_rows_for_text_type(df, n, random_state=None):
    """Gathers about `n` (>= n) rows for each text type\n
    Ensures that all text types have about the same number of words.

    Args:
        df (pd.DataFrame): The DataFrame containing the text data.
        n (int): Number of words to gather.
        random_state (int, optional): Seed for the shuffle function (acts like <code>random_state</code>).

    Returns:
        pandas.core.frame.DataFrame: Gathered rows for each text type.
    """

    def gather_rows_for_type(group, n, random_state=None):
        """Gathers about `n` (>= n) rows for the text type\n

        Args:
            group (): Pandas dataframe group
            n (int): Number of words to gather
            random_state (int): Seed for the shuffle function (acts like <code>random_state</code>)

        Returns:
            pandas.core.frame.DataFrame: Gathered rows for text type
        """
        gathered_rows = pd.DataFrame()
        sources = group['source'].unique()

        if random_state:
            np.random.seed(random_state)
            np.random.shuffle(sources)

        for source in sources:
            source_rows = group[group['source'] == source]
            gathered_rows = pd.concat([gathered_rows, source_rows])
            if len(gathered_rows) >= n:
                break

        return gathered_rows

    grouped = df.groupby('type')
    data = pd.concat([gather_rows_for_type(group, n, random_state) for _, group in grouped])
    return data

Initializing the model<!-- Mudeli ülesehitamine -->

In [49]:
def initialize_model(model_name, unique_labels, is_progress_bars=False):
    # Set up logging
    logger = logging.getLogger('simpletransformers.ner.ner_model')
    logger.setLevel(logging.ERROR)

    # Suppress specific warnings
    # warnings.filterwarnings("ignore", category=FutureWarning) # For warning message "FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated."
    warnings.filterwarnings("ignore", category=UserWarning) # For warnings like "UserWarning: <tag> seems not to be NE tag."

    # Configurations
    model_args = NERArgs()
    model_args.train_batch_size = 8
    model_args.evaluate_during_training = False
    model_args.learning_rate = 5e-5
    model_args.num_train_epochs = 10
    model_args.use_early_stopping = True
    model_args.use_cuda = torch.cuda.is_available()  # Use GPU if available
    model_args.save_eval_checkpoints = False
    model_args.save_model_every_epoch = False # Takes a lot of storage space
    model_args.save_steps = -1
    model_args.overwrite_output_dir = True
    model_args.cache_dir = model_name + '/cache'
    model_args.best_model_dir = model_name + '/best_model'
    model_args.output_dir = model_name
    model_args.use_multiprocessing = False
    model_args.silent = is_progress_bars

    # Initialization
    model = NERModel("camembert", model_name, args=model_args, labels=unique_labels)
    return model

Function to predict tags to words

In [52]:
def predict_tags(model, sentences):

    if isinstance(sentences, str):
        text = estnltk.Text(sentences)
        text.tag_layer("morph_analysis")
        text.morph_analysis;
        sentences = [s.text for s in text.sentences]
        predictions, raw_outputs = model.predict(sentences, split_on_space=False)

    elif isinstance(sentences, list) and isinstance(sentences[0], list):
        predictions, raw_outputs = model.predict(sentences, split_on_space=False)

    else:
        raise TypeError(f"Input is in wrong format. Possible formats are str or list of lists. Your input is {type(sentences)}")

    return predictions, raw_outputs

Function for predicting top n tags to words

In [54]:
def get_top_n_tags(raw_outputs, tag_list, n=3, with_confidence=True):
    """Extract the top <code>n</code> tags and their probabilities for each word based on the raw output logits.

    Args:
        raw_outputs (list): Raw prediction logits from the model.
        tag_list (list): List of all possible tags.
        n (int): Number of top tags to extract.

    Returns:
        list: A list of lists, where each sublist contains tuples of the top `n` tags and their probabilities for a word.
    """
    top_n_tags = []

    for sentence_logits in raw_outputs:
        for word_logits in sentence_logits:

            # Get the logits for the word
            logits = list(word_logits.values())[0][0]

            # Apply softmax to get probabilities
            probabilities = torch.nn.functional.softmax(torch.tensor(logits), dim=-1).numpy()

            # Get the indices of the top `n` probabilities
            top_n_indices = np.argsort(probabilities)[-n:][::-1]

            # Map these indices to the actual tags and their probabilities
            if with_confidence:
                top_n_tags.append(
                    [(tag_list[i], probabilities[i]) for i in top_n_indices]
                )
            else:
                top_n_tags.append(
                    [(tag_list[i]) for i in top_n_indices]
                )

    return top_n_tags

def predict_top_n_tags(model, to_predict, with_confidence=True):
    """Predicts top <code>n</code> tags and their probabilities for each word.

    Args:
        to_predict (list): A python list of text (str) to be sent to the model for prediction.
        Must be a a list of lists, with the inner list being a list of strings consisting of the split sequences. 
        The outer list is the list of sequences to predict on.
        with_confidence (bool): Whether to output confidence values for each tag.
    """
    _, raw_outputs = model.predict(to_predict, split_on_space=False)

    tag_list = model.config.id2label  # Assuming the model has a mapping from ids to labels
    top_n_tags_per_word = get_top_n_tags(raw_outputs, tag_list, n=3, with_confidence=with_confidence)

    for sentence in to_predict:
        for word, tags in zip(sentence, top_n_tags_per_word):
            print(f"Word: {word}")
            if with_confidence:
                for tag, confidence in tags:
                    print(f"\tTag: {tag} \t Confidence: {confidence:.4f}")
            else:
                for tag in tags:
                    print(f"\tTag: {tag}")

Functions to predict labels to the group of sentences. Includes sentence splitting into clauses and clauses into equal length clauses' parts.

In [None]:
def check_token_count(model, sentence):
    """Checks token count in the sentence

    Args:
        model (): NER model that is predicted with
        sentence (list): list of words in a sentence

    Returns:
        bool: Whether the token count exceeds model's maximum sequence length
    """

    inputs = model.tokenizer('  '.join(sentence), return_tensors="pt")
    return bool(len(inputs["input_ids"][0]) >= model.args.max_seq_length)

def get_clause_parts(model, clause):
    """Splits clause into equal length clause segments. <i>Clauses can get long when there is a list in a sentence</i>

    Args:
        model (): NER model that is predicted with
        clause (list): list of words in a clause

    Returns:
        list: List of clause segments. Each segment is a list of words.
    """

    inputs = model.tokenizer('  '.join(clause), return_tensors="pt")
    clause_parts = np.array_split(clause, math.ceil(len(inputs["input_ids"][0]) / model.args.max_seq_length))
    return clause_parts

def get_clauses(model, sentence):
    """Splits sentence into clauses using EstNLTK clauses layer.

    Args:
        model (): NER model that is predicted with
        sentence (list): list of words in a sentence

    Returns:
        list: List of clauses. Each clause is a list of words.
    """

    sentence_text = '  '.join(sentence)
    text = estnltk.Text(sentence_text)
    text.tag_layer('clauses')
    clauses = list()
    for clause in text.clauses:
        if check_token_count(model, clause.text):
            clause_parts = get_clause_parts(model, clause.text)
            clauses.extend(clause_parts)
        else:
            clauses.append(clause.text)
    return clauses

def get_clauses_labels(model, sentence):
    """Predicts labels to clauses

    Args:
        model (): NER model that is predicted with
        sentence (list): list of words in a sentence

    Returns:
        list: List of tags predicted to words in clauses
    """

    clauses = get_clauses(model, sentence)

    ner_labels_parts = list()

    # Predict tags
    predictions, raw_outputs = model.predict(clauses, split_on_space=False)
    for prediction_part in predictions:
        # ner_words_part = [list(p.keys())[0] for p in prediction_part]
        ner_labels_part = [list(p.values())[0] for p in prediction_part]

        # ner_words_parts.append(ner_words_part)
        ner_labels_parts.append(ner_labels_part)

    # ner_words = list(itertools.chain.from_iterable(ner_words_parts))
    ner_labels = list(itertools.chain.from_iterable(ner_labels_parts))

    return ner_labels

def get_sentence_labels(model, sentence):
    """Predicts labels to a sentence

    Args:
        model (): NER model that is predicted with
        sentence (list): list of words in a sentence

    Returns:
        list: List of tags predicted to words in a sentence
    """

    # Predict tags
    predictions, raw_outputs = model.predict([sentence], split_on_space=False)

    # ner_words = [list(p.keys())[0] for p in predictions[0]]
    ner_labels = [list(p.values())[0] for p in predictions[0]]
    return ner_labels

def process_groups(model, groups):
    """Predicts labels to a list of groups. This group contains sentences for each source text file.

    Args:
        model (): NER model that is predicted with
        groups (): Group containing sentences for each source text file

    Raises:
        AssertionError: When the length of predicted labels mismatch the length of the sentence length. 
        <i>This might mean that generated clauses are wrong (meaning some words are missing because of <code>'  '.join(sentence)</code>)</i>

    Returns:
        list: List of predicted labels for each sentence in each source text file
    """
    chunk_results = []

    for _, group in groups:
        sentence = group.words.tolist()

        if check_token_count(model, sentence):  # Sentence splitting if token count is above model's max sequence length
            ner_labels = get_clauses_labels(model, sentence)
        else:
            ner_labels = get_sentence_labels(model, sentence)

        if len(ner_labels) != len(sentence):
            display(group)
            print(len(sentence), len(ner_labels))
            raise AssertionError("Predicted labels length mismatch sentence length.")

        group['ner_labels'] = ner_labels
        chunk_results.append(group)

    return chunk_results

#### Collecting and tokenizing data <!-- Andmete lugemine kogumine ja margendamine -->

<a id='andmete_lugemine_kogumine_ja_margendamine'></a>

Reading text from files, tokenizing texts and save them file by file. <!--Failidest teksti lugemine, märgendamine ja salvestamine failhaaval -->

Collecting `.json` files from directory <!--`json` failide kogumine kaustast-->

In [None]:
in_dir = '_plain_texts_json'

jsons = os.listdir(in_dir)

Data creation and cleaning

In [7]:
if not os.path.exists('./andmestik.csv'):
    create_df_enc2017(jsons, in_dir, 'andmestik.csv')

In [12]:
csv_file = "andmestik.csv"
df = pd.read_csv(csv_file, keep_default_na=False)
clean_df(df, csv_file)

Assigning NaN values in columns form and pos with an empty string
Removing NaN words
Modified dataframe saved to andmestik.csv


In [13]:
create_labels_column(df, csv_file)

Column 'labels' created


##### Verification <!-- Kontrollimine -->

<a id='kontrollimine_1'></a>

Ensuring that correct info was collected from the json file<!-- Kontrollitakse, et json failist said õiged infokillud kogutud -->

In [49]:
# For JSON file
text = estnltk.converters.json_to_text(file='.\\_plain_texts_json\\nc_255_27981.json')
morph_analysis = text.tag_layer('morph_analysis')
ma_json = morph_analysis.sentences[0].morph_analysis
test_json = pd.DataFrame(data=list(zip(ma_json.text, ma_json.form, ma_json.partofspeech)), columns=("text", "form", "partofspeech"))
# display(test_json)

In [None]:
# For CSV file
lines = list()

with open('.\\_plain_texts_csv\\nc_255_27981.csv', 'r') as f:
    csvFile = csv.reader(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for line in csvFile:
        if line:
            lines.append(line)

test_csv = pd.DataFrame(lines[1:], columns=lines[0])

# display(test_csv)

In [50]:
# For Pandas dataframe
test_csv = df[df['source'] == 'nc_255_27981.json']

Example of the first sentence in file `nc_255_27981`<!-- Näide esimesest lausest failis `nc_255_27981` -->

In [55]:
display(test_json[-15:])
display(test_csv[test_csv.sentence_id == 0][-15:])

Unnamed: 0,text,form,partofspeech
15,samas,(sg in),(P)
16,kestab,(b),(V)
17,kohtuvaidlus,(sg n),(S)
18,veel,(),(D)
19,30,(?),(N)
20,FIEna,(sg es),(Y)
21,ametis,(sg in),(S)
22,olnud,"(nud, , sg n, pl n)","(V, A, A, A)"
23,töötaja,(sg g),(S)
24,palga,(sg g),(S)


Unnamed: 0,sentence_id,words,form,pos,type,source
4861762,0,samas,sg in,P,periodicals,nc_255_27981.json
4861763,0,kestab,b,V,periodicals,nc_255_27981.json
4861764,0,kohtuvaidlus,sg n,S,periodicals,nc_255_27981.json
4861765,0,veel,,D,periodicals,nc_255_27981.json
4861766,0,30,?,N,periodicals,nc_255_27981.json
4861767,0,FIEna,sg es,Y,periodicals,nc_255_27981.json
4861768,0,ametis,sg in,S,periodicals,nc_255_27981.json
4861769,0,olnud,nud,V,periodicals,nc_255_27981.json
4861770,0,töötaja,sg g,S,periodicals,nc_255_27981.json
4861771,0,palga,sg g,S,periodicals,nc_255_27981.json


#### Statistics <!-- Statistika -->

<a id='statistika'></a>

In [16]:
# Reading CSV file
csv_file = "andmestik.csv"
df = pd.read_csv(csv_file, keep_default_na=False)

Token count in each text type<!-- Sõnade arv igal tekstiliigil -->

In [17]:
print("Word count")
display(df['type'].value_counts())
print(f'Sum: {df['type'].count()}')

Word count


type
fiction             2170692
science             2121583
periodicals         2099374
wikipedia           2007179
blogs_and_forums    1991307
Name: count, dtype: int64

Sum: 10390135


Sentence count in each text type<!-- Lausete arv igal tekstiliigil -->

In [18]:
text_types = df['type'].unique().tolist()
print("Sentence count")
for text_type in text_types:
    sentence_grouped = df[df['type'] == text_type].groupby(['source', 'sentence_id'])
    sentence_count = sentence_grouped.ngroups
    print(f"{text_type}: {sentence_count}")
sentence_grouped = df.groupby(['source', 'sentence_id'])
sentence_count = sentence_grouped.ngroups
print(f"\nSum: {sentence_count}")

Sentence count
periodicals: 144979
fiction: 165249
science: 129810
blogs_and_forums: 159826


wikipedia: 136377

Sum: 736241


File count in each text type<!-- Failide arv tekstiliigiti -->

In [19]:
print("File count")
for text_type in text_types:
    print(f"{text_type}: {len(df['source'][df['type'] == text_type].value_counts().tolist())}")
print(f"\nSum: {len(df['source'].value_counts().tolist())}")

File count
periodicals: 5917
fiction: 53
science: 229
blogs_and_forums: 3016
wikipedia: 9270

Sum: 18485


Outputs word count for each source (i.e. text file) present in the data

In [None]:
def check_each_source_word_count(data):
    """Outputs word count for each source (i.e. text file) present in the data

    Args:
        data (pandas.core.frame.DataFrame): Pandas dataframe
    """
    for text_type in text_types:
        print(f"{text_type}, sources: {len(data['source'][data['type'] == text_type].value_counts().tolist())}")
        display(data['source'][data['type'] == text_type].value_counts().head(5))
        if len(data['source'][data['type'] == text_type].value_counts().tolist()) > 5:
            display(data['source'][data['type'] == text_type].value_counts().tail(5))

In [None]:
# check_each_source_word_count(df)

#### Data splitting<!-- ### Andmete jagamine treening-, test- ja valideerimishulkadesse -->

<a id='andmete_jagamine_treening_test_ja_valideerimishulkadesse'></a>

The size of the data used in training was based on the [statistics](https://github.com/UniversalDependencies/UD_Estonian-EDT/blob/master/stats.xml) found in [Estonian UD EDT treebank](https://github.com/UniversalDependencies/UD_Estonian-EDT). Data is gathered in such a way that all text types have more or less proportionate numbers of words.

|Text type|Word count|
|---|---|
|Fiction|             115847|
|Blogs and forums|    115719|
|Science|             115347|
|Periodicals|         115050|
|Wikipedia|           115014|
|**Sum**|             576977|

The tag that the model starts to predict consists of a concatenation of columns `form` and `pos` connected with underscore _ symbol (e.g. `form` s and `pos` V will create s_V). In the data those concatenations will be stored in the column `labels`.

##### Grouping data by text type <!-- #### Andmete grupeerimine tekstiliigiti -->

<a id='andmete_grupeerimine_tekstiliigiti'></a>

In [11]:
# Reading CSV file
csv_file = "andmestik.csv"
df = pd.read_csv(csv_file, keep_default_na=False)

```
#pos_labels = ['A', 'C', 'D', 'G', 'H', 'I', 'J', 'K', 'N', 'O', 'P', 'S', 'U', 'V', 'X', 'Y', 'Z']
#form_labels = ['ab', 'abl', 'ad', 'adt', 'all', 'el', 'es', 'g', 'ill', 'in', 'kom', 'n', 'p', 'pl', 'sg', 'ter', 'tr', 'b', 'd', 'da', 'des', 'ge', 'gem', 'gu', 'ks', 'ksid', 'ksime', 'ksin', 'ksite', 'ma', 'maks', 'mas', 'mast', 'mata', 'me', 'n', 'neg', 'neg ge', 'neg gem', 'neg gu', 'neg ks', 'neg me', 'neg nud', 'neg nuks', 'neg o', 'neg vat', 'neg tud', 'nud', 'nuks', 'nuksid', 'nuksime', 'nuksin', 'nuksite', 'nuvat', 'o', 's', 'sid', 'sime', 'sin', 'site', 'ta', 'tagu', 'taks', 'takse', 'tama', 'tav', 'tavat', 'te', 'ti', 'tud', 'tuks', 'tuvat', 'v', 'vad', 'vat']
```

Labels that have '?' in them

In [12]:
question_mark_labels = df[df['labels'].str.contains(r"\?")]
question_mark_labels['labels'].unique()

array(['?_Y', '?_N', '?_O', '?_H', '?_S', '?_A', '?_Z'], dtype=object)

In [48]:
unique_labels = get_unique_labels()
# display(unique_labels)

Creating column 'labels'
Column 'labels' created
List of unique labels created


Saving unique labels for future use

In [16]:
if not os.path.exists("./unique_labels.json"):
    with open("unique_labels.json", 'w') as f:
        json.dump(unique_labels, f, indent=2)

In [None]:
data = gather_rows_for_text_type(df, 115000) #115000 words per text type for NER model
display(data)

In [40]:
# Result
display(data['type'].value_counts())
print(f"Sum: {data['words'].size}")

type
fiction             115847
blogs_and_forums    115719
science             115347
periodicals         115050
wikipedia           115014
Name: count, dtype: int64

Sum: 576977


Gathers text types for evaluating model by text type

In [83]:
text_types = data['type'].unique().tolist()

Saving model data to CSV file

In [None]:
data.to_csv(path_or_buf='model_data.csv', index=False)

##### Splitting data to sets<!-- #### Andmete jagamine hulkadesse -->

<a id='andmete_jagamine_hulkadesse'></a>

In [226]:
# Reading CSV file
data = pd.read_csv("model_data.csv", keep_default_na=False)

Grouping data by filename to preserve the integrity of texts<!-- Andmete grupeerimine failinimede kaupa, et säilitada tekstide terviklikkus -->

In [228]:
grouped = data.groupby('source')

groups = list(grouped.groups.keys())
train_groups, test_groups = sk.model_selection.train_test_split(groups, test_size=0.2, random_state=42)

def filter_by_group(df, groups):
    return df[df['source'].isin(groups)]

# Splitting dataframe
train_df = filter_by_group(data, train_groups)
test_df = filter_by_group(data, test_groups)

Removing unnecessary columns for the model<!-- Mudelile ebavajalike veergude eemaldamine -->

In [56]:
train_df = train_df.drop(labels=['type', 'source'], axis=1)
test_df = test_df.drop(labels=['type', 'source'], axis=1)
# display(train_df)
# display(test_df)

#### Token classification model (for assigning morpological categories)<!-- ### NER Mudel -->

<a id='ner_mudel'></a>

Initializing the model<!-- Mudeli ülesehitamine -->

In [259]:
# Set up logging
logger = logging.getLogger('simpletransformers.ner.ner_model')
logger.setLevel(logging.ERROR)

# Suppress specific warnings
# warnings.filterwarnings("ignore", category=FutureWarning) # For warning message "FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated."
warnings.filterwarnings("ignore", category=UserWarning) # For warnings from seqeval.metrics like "UserWarning: <tag> seems not to be NE tag."

# Configurations
model_args = NERArgs()
model_args.train_batch_size = 8
model_args.evaluate_during_training = False
model_args.learning_rate = 5e-5
model_args.num_train_epochs = 10
model_args.use_early_stopping = True
model_args.use_cuda = torch.cuda.is_available()  # Use GPU if available
model_args.save_eval_checkpoints = False
model_args.save_model_every_epoch = False # Takes a lot of storage space
model_args.save_steps = -1
model_args.overwrite_output_dir = True
model_args.cache_dir = 'NER_mudel/cache'
model_args.best_model_dir = 'NER_mudel/best_model'
model_args.output_dir = 'NER_mudel'
model_args.use_multiprocessing = False
model_args.silent = True

# Initialization
model = NERModel("camembert", "EMBEDDIA/est-roberta", args=model_args, labels=unique_labels)

Training the model<!-- Mudeli treenimine -->

In [260]:
if not os.path.exists(".\\NER_mudel\\config.json"):
    # Train model
    print("Training model")
    model.train_model(train_df, eval_data=test_df)
else:
    model = NERModel("camembert", "NER_mudel", args=model_args, labels=unique_labels)

Evaluating the model<!-- Mudeli hindamine -->

In [98]:
# Evaluate the model
result, model_outputs, preds_list = model.eval_model(test_df)
print(result)

INFO:ner_model.py:1884:  Converting to features started.


100%|██████████| 1/1 [00:50<00:00, 50.58s/it]
  with amp.autocast():
Running Evaluation: 100%|██████████| 4/4 [00:00<00:00,  5.77it/s]


INFO:ner_model.py:1253: {'eval_loss': 0.24354228004813194, 'precision': 0.9536447592575988, 'recall': 0.9518951089631718, 'f1_score': 0.95276913085347}
{'eval_loss': 0.24354228004813194, 'precision': 0.9536447592575988, 'recall': 0.9518951089631718, 'f1_score': 0.95276913085347}


In [None]:
print(f"Evaluation Loss:{result['eval_loss']:.4f}")
print(f"Precision: \t{result['precision']:.4f}")
print(f"Recall: \t{result['recall']:.4f}")
print(f"F1 Score: \t{result['f1_score']:.4f}")

As you can see a lot of UserWarnings are logged in the output. This does not impose a problem because the model can still predict these tags.

Evaluating the model by text type<!-- Mudeli hindamine tekstiliigiti -->

In [None]:
for text_type in text_types:

    print(f"Text type:\t{text_type}")

    text_type_data = data.loc[data['type'] == text_type]

    print(f"Token count: {len(text_type_data)}")

    print(f"Starting evaluation")

    # Evaluate the model
    result, model_outputs, preds_list = model.eval_model(text_type_data)
    print(result)
    print(f"Evaluation Loss:{result['eval_loss']:.4f}")
    print(f"Precision: \t{result['precision']:.4f}")
    print(f"Recall: \t{result['recall']:.4f}")
    print(f"F1 Score: \t{result['f1_score']:.4f}")

Make predictions with the model <!-- Mudelil ennustamine -->

In [235]:
predictions, raw_outputs = model.predict([['A. H. Tammsaare',
 'oli',
 'eesti',
 'kirjanik',
 ',',
 'esseist',
 ',',
 'kultuurifilosoof',
 'ja',
 'tõlkija',
 '.']], split_on_space=False)

print(predictions)

[[{'A. H. Tammsaare': 'sg n_H'}, {'oli': 's_V'}, {'eesti': 'G'}, {'kirjanik': 'sg n_S'}, {',': 'Z'}, {'esseist': 'sg n_S'}, {',': 'Z'}, {'kultuurifilosoof': 'sg n_S'}, {'ja': 'J'}, {'tõlkija': 'sg n_S'}, {'.': 'Z'}]]


In [234]:
test_text = estnltk.Text("A. H. Tammsaare oli eesti kirjanik, esseist, kultuurifilosoof ja tõlkija. Üksnes autorihüvitis oli 12 431 krooni.")
test_text.tag_layer("morph_analysis")
test_text.morph_analysis;
sentences = [s.text for s in test_text.sentences]

# Predict tags
predictions, raw_outputs = model.predict(sentences, split_on_space=False)

# Output of predictions
print(predictions)

[[{'A. H. Tammsaare': 'sg n_H'}, {'oli': 's_V'}, {'eesti': 'G'}, {'kirjanik': 'sg n_S'}, {',': 'Z'}, {'esseist': 'sg n_S'}, {',': 'Z'}, {'kultuurifilosoof': 'sg n_S'}, {'ja': 'J'}, {'tõlkija': 'sg n_S'}, {'.': 'Z'}], [{'Üksnes': 'D'}, {'autorihüvitis': 'sg n_S'}, {'oli': 's_V'}, {'12 431': '?_N'}, {'krooni': 'sg p_S'}, {'.': 'Z'}]]


Top n tags prediction<!-- ### Top n märgendi ennustamine -->

In [None]:
# Top n tags prediction for tokens
# # Determine the device
# device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# # Ensure model is on the right device
# model.model.to(device)

# # Sample input sentence
# sentence = "A. H. Tammsaare oli eesti kirjanik, esseist, kultuurifilosoof ja tõlkija."

# # Tokenize and predict (This is normally done internally by simpletransformers)
# inputs = model.tokenizer(sentence, return_tensors="pt")
# inputs = {key: value.to(model.device) for key, value in inputs.items()}

# # Get the logits from the model
# model_output = model.model(**inputs)
# logits = model_output.logits  # Ensure logits are on the same device

# # Apply softmax to logits to get probabilities
# probs = torch.nn.functional.softmax(logits, dim=-1)

# # Get top n tags for each token
# top_n = 3
# top_n_probs, top_n_indices = torch.topk(probs, top_n, dim=-1)

# # Decode the tags
# for token_idx in range(len(inputs["input_ids"][0])):
#     token = model.tokenizer.decode([inputs["input_ids"][0][token_idx]])
#     print(f"Token: {token}")
#     for i in range(top_n):
#         label_idx = top_n_indices[0][token_idx][i].item()
#         label_prob = top_n_probs[0][token_idx][i].item()
#         label = model.args.labels_list[label_idx]
#         print(f"  Tag {i+1}: {label} (Probability: {label_prob:.4f})")

Token: <s>
  Tag 1: Z (Probability: 0.9998)
  Tag 2: J (Probability: 0.0000)
  Tag 3: ?_Y (Probability: 0.0000)
Token: A
  Tag 1: sg n_H (Probability: 0.9954)
  Tag 2: sg g_H (Probability: 0.0024)
  Tag 3: ?_Y (Probability: 0.0004)
Token: .
  Tag 1: Z (Probability: 0.9995)
  Tag 2: ?_Y (Probability: 0.0001)
  Tag 3: J (Probability: 0.0001)
Token: H
  Tag 1: sg n_H (Probability: 0.9968)
  Tag 2: ?_Y (Probability: 0.0010)
  Tag 3: sg g_H (Probability: 0.0004)
Token: .
  Tag 1: Z (Probability: 0.9993)
  Tag 2: J (Probability: 0.0002)
  Tag 3: ?_Y (Probability: 0.0001)
Token: Tammsaare
  Tag 1: sg n_H (Probability: 0.9994)
  Tag 2: sg g_H (Probability: 0.0002)
  Tag 3: sg n_S (Probability: 0.0001)
Token: oli
  Tag 1: s_V (Probability: 0.9999)
  Tag 2: sg n_S (Probability: 0.0000)
  Tag 3: A (Probability: 0.0000)
Token: eesti
  Tag 1: G (Probability: 0.9958)
  Tag 2: sg g_S (Probability: 0.0004)
  Tag 3: sg g_H (Probability: 0.0003)
Token: kirjanik
  Tag 1: sg n_S (Probability: 0.9997)
  Ta

In [None]:
predict_top_n_tags(model, [['A. H. Tammsaare',
 'oli',
 'eesti',
 'kirjanik',
 ',',
 'esseist',
 ',',
 'kultuurifilosoof',
 'ja',
 'tõlkija',
 '.']])

Word: A. H. Tammsaare
	Tag: sg n_H 	 Confidence: 0.9976
	Tag: sg g_H 	 Confidence: 0.0012
	Tag: ?_Y 	 Confidence: 0.0001
Word: oli
	Tag: s_V 	 Confidence: 1.0000
	Tag: sg n_S 	 Confidence: 0.0000
	Tag: A 	 Confidence: 0.0000
Word: eesti
	Tag: G 	 Confidence: 0.9956
	Tag: sg g_S 	 Confidence: 0.0004
	Tag: sg g_H 	 Confidence: 0.0002
Word: kirjanik
	Tag: sg n_S 	 Confidence: 0.9995
	Tag: sg in_S 	 Confidence: 0.0000
	Tag: sg g_S 	 Confidence: 0.0000
Word: ,
	Tag: Z 	 Confidence: 1.0000
	Tag: J 	 Confidence: 0.0000
	Tag: A 	 Confidence: 0.0000
Word: esseist
	Tag: sg n_S 	 Confidence: 0.9995
	Tag: sg n_A 	 Confidence: 0.0001
	Tag: sg in_S 	 Confidence: 0.0000
Word: ,
	Tag: Z 	 Confidence: 1.0000
	Tag: J 	 Confidence: 0.0000
	Tag: s_V 	 Confidence: 0.0000
Word: kultuurifilosoof
	Tag: sg n_S 	 Confidence: 0.9995
	Tag: sg n_A 	 Confidence: 0.0000
	Tag: sg in_S 	 Confidence: 0.0000
Word: ja
	Tag: J 	 Confidence: 1.0000
	Tag: Z 	 Confidence: 0.0000
	Tag: o_V 	 Confidence: 0.0000
Word: tõlkija
	

#### Token classification model (for assigning morpological categories) NER model by text type<!-- ### NER mudel tekstiliigiti -->

<a id='ner_mudel_tekstiliigiti'></a>

In [25]:
for text_type in text_types:

    print(f"Text type:\t{text_type}")

    text_type_data = data.loc[data['type'] == text_type]

    print(f"Token count: {len(text_type_data)}")

    grouped = data.groupby('source')

    groups = list(grouped.groups.keys())
    train_groups, test_groups = sk.model_selection.train_test_split(groups, test_size=0.2, random_state=42)

    # Split the dataframe
    train_df = df[df['source'].isin(train_groups)]
    test_df = df[df['source'].isin(test_groups)]

    print(f"Train- and testset created.\Beginning to train")

    # Model

    logging.basicConfig(level=logging.INFO)
    transformers_logger = logging.getLogger("transformers")
    transformers_logger.setLevel(logging.WARNING)

    # Configure the model
    model_args = NERArgs()
    model_args.train_batch_size = 8
    model_args.evaluate_during_training = False
    model_args.learning_rate = 5e-5
    model_args.num_train_epochs = 10
    model_args.use_early_stopping = True
    model_args.use_cuda = torch.cuda.is_available()  # Use GPU if available
    model_args.overwrite_output_dir = True
    model_args.save_eval_checkpoints = False
    model_args.save_model_every_epoch = False
    model_args.save_steps = -1
    model_args.cache_dir = text_type + '_NER_mudel/cache'
    model_args.best_model_dir = text_type + '_NER_mudel/best_model'
    model_args.output_dir = text_type + '_NER_mudel'

    # Initialize the model
    model = NERModel("camembert", "EMBEDDIA/est-roberta", args=model_args, labels=unique_labels)

    # Train the model
    model.train_model(train_df, eval_data=test_df)

    print(f"Starting evaluation")

    # Evaluate the model
    result, model_outputs, preds_list = model.eval_model(text_type_data)
    print(result)
    print(f"Evaluation Loss:{result['eval_loss']:.4f}")
    print(f"Precision: \t{result['precision']:.4f}")
    print(f"Recall: \t{result['recall']:.4f}")
    print(f"F1 Score: \t{result['f1_score']:.4f}")

Tekstiliik:	blogs_and_forums
Suurus: 115719 sõna
Alustan hindamisega
INFO:ner_model.py:1884:  Converting to features started.


100%|██████████| 2/2 [00:08<00:00,  4.49s/it]
  with amp.autocast():
Running Evaluation: 100%|██████████| 9/9 [00:01<00:00,  6.79it/s]


INFO:ner_model.py:1253: {'eval_loss': 0.08652005354977316, 'precision': 0.9740840914679124, 'recall': 0.9738446411012782, 'f1_score': 0.9739643515673017}
{'eval_loss': 0.08652005354977316, 'precision': 0.9740840914679124, 'recall': 0.9738446411012782, 'f1_score': 0.9739643515673017}
Tekstiliik:	fiction
Suurus: 115847 sõna
Alustan hindamisega
INFO:ner_model.py:1884:  Converting to features started.


100%|██████████| 15/15 [00:18<00:00,  1.24s/it]
Running Evaluation: 100%|██████████| 75/75 [00:12<00:00,  6.13it/s]


INFO:ner_model.py:1253: {'eval_loss': 0.0043835937487892804, 'precision': 0.9991472669907052, 'recall': 0.9991046685141761, 'f1_score': 0.9991259672983863}
{'eval_loss': 0.0043835937487892804, 'precision': 0.9991472669907052, 'recall': 0.9991046685141761, 'f1_score': 0.9991259672983863}
Tekstiliik:	periodicals
Suurus: 115050 sõna
Alustan hindamisega
INFO:ner_model.py:1884:  Converting to features started.


100%|██████████| 1/1 [00:09<00:00,  9.00s/it]
Running Evaluation: 100%|██████████| 3/3 [00:00<00:00,  6.96it/s]


INFO:ner_model.py:1253: {'eval_loss': 0.11051202068726222, 'precision': 0.9810868635002139, 'recall': 0.9799128130609454, 'f1_score': 0.9804994868286008}
{'eval_loss': 0.11051202068726222, 'precision': 0.9810868635002139, 'recall': 0.9799128130609454, 'f1_score': 0.9804994868286008}
Tekstiliik:	science
Suurus: 115347 sõna
Alustan hindamisega
INFO:ner_model.py:1884:  Converting to features started.


100%|██████████| 1/1 [00:08<00:00,  8.86s/it]
Running Evaluation: 100%|██████████| 4/4 [00:00<00:00,  6.16it/s]


INFO:ner_model.py:1253: {'eval_loss': 0.1578108500689268, 'precision': 0.9701866365031914, 'recall': 0.971324570273003, 'f1_score': 0.9707552699124881}
{'eval_loss': 0.1578108500689268, 'precision': 0.9701866365031914, 'recall': 0.971324570273003, 'f1_score': 0.9707552699124881}
Tekstiliik:	wikipedia
Suurus: 115014 sõna
Alustan hindamisega
INFO:ner_model.py:1884:  Converting to features started.


100%|██████████| 1/1 [00:09<00:00,  9.05s/it]
Running Evaluation: 100%|██████████| 4/4 [00:00<00:00,  8.53it/s]


INFO:ner_model.py:1253: {'eval_loss': 0.32173459976911545, 'precision': 0.9479583095374072, 'recall': 0.9465393114263312, 'f1_score': 0.9472482790598138}
{'eval_loss': 0.32173459976911545, 'precision': 0.9479583095374072, 'recall': 0.9465393114263312, 'f1_score': 0.9472482790598138}


New trained and evaluated NER models

| Text type         | Token count | eval_loss            | precision          | recall             | f1_score           |
|-------------------|-------------|----------------------|--------------------|--------------------|--------------------|
| blogs_and_forums  | 115719      | 0.23976334929466248  | 0.9538220130015691 | 0.9520293551707164 | 0.9529248409925647 |
| fiction           | 115847      | 0.2396537810564041   | 0.9538792523867151 | 0.952342596321654  | 0.9531103049845493 |
| periodicals       | 115050      | 0.24317898601293564  | 0.9550244383659926 | 0.9530585760952253 | 0.9540404945350295 |
| science           | 115347      | 0.2376849427819252   | 0.9557577659240665 | 0.9541325457555824 | 0.9549444643496955 |
| wikipedia         | 115014      | 0.23429438471794128  | 0.9547409930094999 | 0.953416565982011  | 0.9540783198620781 |

General NER model evaluations by text type

| Text type         | Token count | eval_loss            | precision          | recall             | f1_score           |
|-------------------|-------------|----------------------|--------------------|--------------------|--------------------|
| blogs_and_forums  | 115719      | 0.08935269061475992  | 0.974046101995129  | 0.9733038348082595 | 0.9736748269375007 |
| fiction           | 115847      | 0.004841572717996314 | 0.9990726034260374 | 0.9989767640162013 | 0.9990246814225794 |
| periodicals       | 115050      | 0.09677119304736455  | 0.9807560725282244 | 0.9801692452346354 | 0.9804625710743448 |
| science           | 115347      | 0.1642883848398924   | 0.9677601809954751 | 0.9688169868554095 | 0.9682882955716798 |
| wikipedia         | 115014      | 0.2906533405184746   | 0.9479240975888144 | 0.9471808396892152 | 0.9475523228865832 |

#### Gathering unused data and comparing Vabamorf / NER model <!-- Kasutamata andmete kogumine ja Vabamorfi ning NER mudeli võrdlemine -->

<a id='kasutamata_andmete_kogumine'></a>

In [None]:
# Reading CSV file
df = pd.read_csv("andmestik.csv", keep_default_na=False)
model_df = pd.read_csv("model_data.csv", keep_default_na=False)

In [26]:
# Ensure both DataFrames have the same columns for comparison
common_columns = df.columns.intersection(model_df.columns)
# Perform a left anti-join to get the unused data
unused_data = df.merge(model_df, on=common_columns.tolist(), how='left', indicator=True)
unused_data = unused_data[unused_data['_merge'] == 'left_only'].drop(columns=['_merge'])

In [28]:
# Print or save the result
display(unused_data)
if not os.path.exists("./unused_data.csv"):
    unused_data.to_csv("unused_data.csv", index=False)

Unnamed: 0,sentence_id,words,form,pos,type,source,labels
131164,0,BAGDAD,sg n,H,periodicals,nc_10532_642042.json,sg n_H
131165,0,",",,Z,periodicals,nc_10532_642042.json,Z
131166,0,29.,?,O,periodicals,nc_10532_642042.json,?_O
131167,0,november,sg n,S,periodicals,nc_10532_642042.json,sg n_S
131168,0,(,,Z,periodicals,nc_10532_642042.json,Z
...,...,...,...,...,...,...,...
10555188,2,tulemusena,sg es,S,wikipedia,wiki17_99964_x.json,sg es_S
10555189,2,[,,Z,wikipedia,wiki17_99964_x.json,Z
10555190,2,2,?,N,wikipedia,wiki17_99964_x.json,?_N
10555191,2,],,Z,wikipedia,wiki17_99964_x.json,Z


In [17]:
unused_data = pd.read_csv("unused_data.csv", keep_default_na=False)

In [16]:
comparison_data = gather_rows_for_text_type(unused_data, 600000, 42)

In [17]:
if not os.path.exists("./comparison_data.csv"):
    comparison_data.to_csv("comparison_data.csv", index=False)

In [21]:
comparison_data = pd.read_csv("comparison_data.csv", keep_default_na=False)

In [22]:
comparison_data

Unnamed: 0,sentence_id,words,form,pos,type,source,labels
0,0,Koju,adt,S,blogs_and_forums,web13_274106_x.json,adt_S
1,0,jõudes,des,V,blogs_and_forums,web13_274106_x.json,des_V
2,0,ootas,s,V,blogs_and_forums,web13_274106_x.json,s_V
3,0,ees,,D,blogs_and_forums,web13_274106_x.json,D
4,0,üllatus,sg n,S,blogs_and_forums,web13_274106_x.json,sg n_S
...,...,...,...,...,...,...,...
3056351,17,aastaks,sg tr,S,wikipedia,wiki17_85786_x.json,sg tr_S
3056352,17,oli,s,V,wikipedia,wiki17_85786_x.json,s_V
3056353,17,ta,sg n,P,wikipedia,wiki17_85786_x.json,sg n_P
3056354,17,surnud,,A,wikipedia,wiki17_85786_x.json,A


In [23]:
comparison_groups = comparison_data.groupby(['source', 'sentence_id'])

In [24]:
with open("unique_labels.json", 'r') as f:
    unique_labels = json.load(f)

model = initialize_model("NER_mudel", unique_labels)

In [26]:
predicted_chunks = process_groups(model, comparison_groups)

In [27]:
updated_comparison_data = pd.concat(predicted_chunks)

In [29]:
updated_comparison_data.to_csv("updated_comparison_data.csv", index=False)

In [30]:
updated_comparison_data = pd.read_csv('./updated_comparison_data.csv', keep_default_na=False)

In [32]:
label_differences = updated_comparison_data[updated_comparison_data['labels'] != updated_comparison_data['ner_labels']]

In [56]:
display(label_differences)

Unnamed: 0,sentence_id,words,form,pos,type,source,labels,ner_labels
0,0,OKLAHOMA,sg g,H,periodicals,nc_10532_642051.json,sg g_H,sg n_H
5,0,Reuters-EPLO,?,Y,periodicals,nc_10532_642051.json,?_Y,sg n_H
8,0,Oklahoma,sg g,H,periodicals,nc_10532_642051.json,sg g_H,sg n_H
47,1,sajalases,sg in,S,periodicals,nc_10532_642051.json,sg in_S,sg in_A
61,2,Timothy,sg g,H,periodicals,nc_10532_642051.json,sg g_H,sg n_H
...,...,...,...,...,...,...,...,...
3056133,20,maini,pl ter,S,wikipedia,wiki17_99902_x.json,pl ter_S,sg ter_S
3056235,6,Kloogaranna,sg n,H,wikipedia,wiki17_99953_x.json,sg n_H,sg g_H
3056247,7,ehitatud,tud,V,wikipedia,wiki17_99953_x.json,tud_V,A
3056303,9,teise,adt,P,wikipedia,wiki17_99953_x.json,adt_P,sg g_P


In [61]:
print(f"Differences count: {len(label_differences)} / {len(updated_comparison_data)} ({round(len(label_differences) / len(updated_comparison_data) * 100, 3)}%)")

Differences count: 139917 / 3056356 (4.578%)


### Testing *(will be removed afterwards)*

In [73]:
test_text = estnltk.Text("A. H. Tammsaare oli eesti kirjanik, esseist, kultuurifilosoof ja tõlkija. Üksnes autorihüvitis oli 12 431 krooni.")
test_text.tag_layer("morph_analysis")
test_text.morph_analysis;

In [74]:
sentences = [s.text for s in test_text.sentences]

In [325]:
# Predict tags
predictions, raw_outputs = model.predict(sentences, split_on_space=False)

# Output of predictions
print(predictions)

INFO:ner_model.py:1884:  Converting to features started.
[[{'A. H. Tammsaare': 'sg n_H'}, {'oli': 's_V'}, {'eesti': 'G'}, {'kirjanik': 'sg n_S'}, {',': 'Z'}, {'esseist': 'sg n_S'}, {',': 'Z'}, {'kultuurifilosoof': 'sg n_S'}, {'ja': 'J'}, {'tõlkija': 'sg n_S'}, {'.': 'Z'}], [{'Üksnes': 'D'}, {'autorihüvitis': 'sg n_S'}, {'oli': 's_V'}, {'12 431': '?_N'}, {'krooni': 'sg p_S'}, {'.': 'Z'}]]


In [326]:
display(test_text.morph_analysis)

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,17

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
A. H. Tammsaare,A. H. Tammsaare,A. H. Tammsaare,A. _H. _Tamm_saare,"['A. ', 'H. ', 'Tamm', 'saare']",0,,sg n,H
oli,oli,olema,ole,['ole'],i,,s,V
eesti,eesti,eesti,eesti,['eesti'],0,,,G
kirjanik,kirjanik,kirjanik,kirjanik,['kirjanik'],0,,sg n,S
",",",",",",",","[',']",,,,Z
esseist,esseist,essee,essee,['essee'],ist,,pl el,S
",",",",",",",","[',']",,,,Z
kultuurifilosoof,kultuurifilosoof,kultuurifilosoof,kultuuri_filosoof,"['kultuuri', 'filosoof']",0,,sg n,S
ja,ja,ja,ja,['ja'],0,,,J
tõlkija,tõlkija,tõlkija,tõlkija,['tõlkija'],0,,sg n,S


In [304]:
predictions, raw_outputs = model.predict(["Uuringutulemuste interpretatsioon jätab mitmed küsimused vastuseta.", "Kas tuba on soe või külm?"])

# Output of predictions
print(predictions)

INFO:ner_model.py:1884:  Converting to features started.
[[{'Uuringutulemuste': 'pl g_S'}, {'interpretatsioon': 'sg n_S'}, {'jätab': 'b_V'}, {'mitmed': 'pl n_P'}, {'küsimused': 'pl n_S'}, {'vastuseta.': 'sg ab_S'}], [{'Kas': 'D'}, {'tuba': 'sg n_S'}, {'on': 'b_V'}, {'soe': 'sg n_A'}, {'või': 'J'}, {'külm?': 'sg n_A'}]]


In [203]:
test_data = comparison_data[(comparison_data['source'] == 'nc_8303_518571.json') & (comparison_data['sentence_id'] == 797)]

In [204]:
text_list = test_data.words.tolist()

In [205]:
test_data

Unnamed: 0,sentence_id,words,form,pos,type,source,labels
1467181,797,Tegevusalad,pl n,S,periodicals,nc_8303_518571.json,pl n_S
1467182,797,:,,Z,periodicals,nc_8303_518571.json,Z
1467183,797,puit-,sg n,S,periodicals,nc_8303_518571.json,sg n_S
1467184,797,",",,Z,periodicals,nc_8303_518571.json,Z
1467185,797,plast-,sg n,S,periodicals,nc_8303_518571.json,sg n_S
...,...,...,...,...,...,...,...
1467249,797,",",,Z,periodicals,nc_8303_518571.json,Z
1467250,797,paljundustööd,pl n,S,periodicals,nc_8303_518571.json,pl n_S
1467251,797,",",,Z,periodicals,nc_8303_518571.json,Z
1467252,797,infoteenused,pl n,S,periodicals,nc_8303_518571.json,pl n_S


In [206]:
inputs = model.tokenizer(' '.join(text_list), return_tensors="pt")

In [207]:
len(inputs["input_ids"][0])

128

In [213]:
for token_idx in range(len(inputs["input_ids"][0])):
    token = model.tokenizer.decode([inputs["input_ids"][0][token_idx]])
    print(f"Token: {token}")

Token: <s>
Token: Tegevus
Token: alad
Token: :
Token: puit
Token: -
Token: ,
Token: plast
Token: -
Token: ,
Token: klaas
Token: -
Token: ,
Token: kummi
Token: -
Token: ja
Token: metall
Token: it
Token: oodete
Token: tootmine
Token: ,
Token: toidu
Token: kaupade
Token: tootmine
Token: ,
Token: toidu
Token: kaupade
Token: jae
Token: -
Token: ja
Token: hul
Token: gika
Token: ub
Token: andus
Token: (
Token: v
Token: .
Token: a
Token: .
Token: litse
Token: ntse
Token: eri
Token: tavad
Token: kaubad
Token: )
Token: ,
Token: tööstus
Token: kaupade
Token: jae
Token: -
Token: ,
Token: hulgi
Token: -
Token: ja
Token: komisjon
Token: ika
Token: ub
Token: andus
Token: ,
Token: arvuti
Token: -
Token: ,
Token: video
Token: -
Token: ja
Token: hel
Token: isal
Token: vest
Token: iste
Token: palju
Token: ndamine
Token: ja
Token: laen
Token: utamine
Token: ,
Token: interj
Token: öör
Token: iku
Token: ju
Token: ndus
Token: ,
Token: kuju
Token: ndust
Token: ööd
Token: ,
Token: demonst
Token: ratsioonide
To

In [208]:
text = estnltk.Text('  '.join(text_list))
text.tag_layer('clauses')
display(text)
clauses = [clause.text for clause in text.clauses]
for clause in clauses:
    print(len(clause))
print("Sum:",sum(len(clause) for clause in clauses))

text
"Tegevusalad : puit- , plast- , klaas- , kummi- ja metallitoodete tootmine , toidukaupade tootmine , toidukaupade jae- ja hulgikaubandus ( v . a . litsentseeritavad kaubad ) , tööstuskaupade jae- , hulgi- ja komisjonikaubandus , arvuti- , video- ja helisalvestiste paljundamine ja laenutamine , interjöörikujundus , kujundustööd , demonstratsioonide , näituste , nõupidamiste , konverentside ja spordivõistluste organiseerimine , ärikontaktide vahendamine , transporditeenused ( Eesti Vabariigi piires ) , teeninduspunktid , paljundustööd , infoteenused ."

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,85
compound_tokens,"type, normalized",,tokens,False,10
words,normalized_form,,,True,73
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,73
clauses,clause_type,,words,False,4


2
61
5
5
Sum: 73


In [209]:
for i in range(len(text_list)):
    print(text_list[i], text.words[i].text)

Tegevusalad Tegevusalad
: :
puit- puit-
, ,
plast- plast-
, ,
klaas- klaas-
, ,
kummi- kummi-
ja ja
metallitoodete metallitoodete
tootmine tootmine
, ,
toidukaupade toidukaupade
tootmine tootmine
, ,
toidukaupade toidukaupade
jae- jae-
ja ja
hulgikaubandus hulgikaubandus
( (
v . a . v . a .
litsentseeritavad litsentseeritavad
kaubad kaubad
) )
, ,
tööstuskaupade tööstuskaupade
jae- jae-
, ,
hulgi- hulgi-
ja ja
komisjonikaubandus komisjonikaubandus
, ,
arvuti- arvuti-
, ,
video- video-
ja ja
helisalvestiste helisalvestiste
paljundamine paljundamine
ja ja
laenutamine laenutamine
, ,
interjöörikujundus interjöörikujundus
, ,
kujundustööd kujundustööd
, ,
demonstratsioonide demonstratsioonide
, ,
näituste näituste
, ,
nõupidamiste nõupidamiste
, ,
konverentside konverentside
ja ja
spordivõistluste spordivõistluste
organiseerimine organiseerimine
, ,
ärikontaktide ärikontaktide
vahendamine vahendamine
, ,
transporditeenused transporditeenused
( (
Eesti Eesti
Vabariigi Vabariigi
piires piire

In [210]:
predictions, raw_outputs = model.predict(clauses, split_on_space=False)
ner_labels_parts = list()
for prediction_part in predictions:
    ner_labels_part = [list(p.values())[0] for p in prediction_part]
    print(len(ner_labels_part), ner_labels_part)
    ner_labels_parts.append(ner_labels_part)

ner_labels = list(itertools.chain.from_iterable(ner_labels_parts))

2 ['pl n_S', 'Z']
61 ['sg n_S', 'Z', 'sg n_S', 'Z', 'sg n_S', 'Z', 'sg g_S', 'J', 'pl g_S', 'sg n_S', 'Z', 'pl g_S', 'sg n_S', 'Z', 'pl g_S', 'sg g_S', 'J', 'sg n_S', 'Z', 'pl g_S', 'sg g_S', 'Z', 'sg n_S', 'J', 'sg n_S', 'Z', 'sg g_S', 'Z', 'sg g_S', 'J', 'pl g_S', 'sg n_S', 'J', 'sg n_S', 'Z', 'sg n_S', 'Z', 'sg n_S', 'Z', 'pl g_S', 'Z', 'pl g_S', 'Z', 'pl g_S', 'Z', 'pl g_S', 'J', 'pl g_S', 'sg n_S', 'Z', 'pl g_S', 'sg n_S', 'Z', 'pl n_S', 'Z', 'pl n_S', 'Z', 'sg p_S', 'Z', 'pl n_S', 'Z']
5 ['Z', '?_Y', 'pl n_A', 'pl n_S', 'Z']
5 ['Z', 'sg g_H', 'sg g_S', 'K', 'Z']


In [217]:
predictions, raw_outputs = model.predict([text.words.text], split_on_space=False)

# ner_words = [list(p.keys())[0] for p in predictions[0]]
ner_labels = [list(p.values())[0] for p in predictions[0]]

In [218]:
len(ner_labels)

72

In [174]:
predictions

[[{'Sisaldab': 'b_V'}, {':': 'Z'}],
 [{'dementsus': 'sg n_S'}, {':': 'Z'}],
 [{'-': 'Z'},
  {'CO-mürgistusest': 'sg el_S'},
  {';': 'Z'},
  {'-': 'Z'},
  {'tserebraalsest': 'sg el_A'},
  {'lipidoosist': 'sg el_S'},
  {';': 'Z'},
  {'-': 'Z'},
  {'epilepsiast': 'sg el_S'},
  {';': 'Z'},
  {'-': 'Z'},
  {'hepatolentikulaarsest': 'sg el_A'},
  {'degeneratsioonist': 'sg el_S'},
  {';': 'Z'},
  {'-': 'Z'},
  {'hüperkaltseemiast': 'sg el_S'},
  {';': 'Z'},
  {'-': 'Z'},
  {'kilpnäärmehaigustest': 'pl el_S'},
  {';': 'Z'},
  {'-': 'Z'},
  {'intoksikatsioonidest': 'pl el_S'},
  {';': 'Z'},
  {'-': 'Z'},
  {'sclerosis': 'sg n_S'},
  {"multiplex ' ist": 'sg n_S'},
  {';': 'Z'},
  {'-': 'Z'},
  {'neurosüüfilisest': 'sg el_S'},
  {';': 'Z'},
  {'-': 'Z'},
  {'niatsiinivaegusest': 'sg el_S'},
  {';': 'Z'},
  {'-': 'Z'},
  {'polyarteritis': 'sg in_S'},
  {"nodosa ' st": 'sg n_S'},
  {'-': 'Z'},
  {'süsteemsest': 'sg el_A'},
  {'lupus': 'sg n_S'},
  {'erythematosus': 'sg n_S'},
  {"'": 'Z'},
  {'est'

In [173]:
len(ner_labels)

132

In [22]:
text = estnltk.Text("Paki tagaküljele - \" Suitsetamine lühendab eluiga , \" \" Suitsetamine ummistab veresooni ning põhjustab südameinfarkte ja rabandusi , \" \" Suitsetamine tekitab surmavat kopsuvähki , \" \" Raseduse ajal suitsetamine kahjustab sinu last , \" \" Kaitse lapsi : ära sunni neid hingama tubakasuitsu , \" \" Arstilt või apteekrilt saad abi suitsetamisest loobumiseks , \" \" Suitsetamine tekitab kergesti sõltuvust , ära alusta , \" \" Suitsetamisest loobumine vähendab ohtu haigestuda surmavatesse südame- ja kopsuhaigustesse , \" \" Otsi abi suitsetamisest loobumiseks : küsi nõu oma perearstilt või lähimast apteegist . \"")
# text = estnltk.Text("Mul on taskus kolm asja: telefon, rahakott ja võtmed.")
# text = estnltk.Text("Ta ei söö herneid, sest need ei maitse talle.")
text.tag_layer('clauses')
clauses = [clause.text for clause in text.clauses]

In [81]:
text.clauses

layer name,attributes,parent,enveloping,ambiguous,span count
clauses,clause_type,,words,False,3

text,clause_type
"['231', 'Kirikuseadus', 'jaguneb', 'kahekümne', 'kaheksaks', 'peatükiks', ':']",regular
"['1', ')', 'Õigest', 'kristlikust', 'õpetusest', ',', '2', ')', 'Jutlustest', 'j ..., type: , length: 179",regular
"['(', 'Kuuenädalaste', ')']",embedded


In [68]:
clauses_text = [' '.join(clause) for clause in clauses]
full_text = ' '.join(clauses_text)
diff_text = estnltk.Text(full_text)
diff_text.tag_layer('clauses')
diff_clauses = [clause.text for clause in diff_text.clauses]

NameError: name 'clauses' is not defined

In [34]:
diff_text.clauses

layer name,attributes,parent,enveloping,ambiguous,span count
clauses,clause_type,,words,False,13

text,clause_type
"['Paki', 'tagaküljele', '-', '""', 'Suitsetamine', 'lühendab', 'eluiga', ',']",regular
"['""', '""', 'Suitsetamine', 'ummistab', 'veresooni', 'ning']",regular
"['põhjustab', 'südameinfarkte', 'ja', 'rabandusi', ',']",regular
"['""', '""', 'Suitsetamine', 'tekitab', 'surmavat', 'kopsuvähki', ',']",regular
"['""', '""', 'Raseduse', 'ajal', 'suitsetamine', 'kahjustab', 'sinu', 'last', ',']",regular
"['""', '""', 'Kaitse', 'lapsi', ':']",regular
"['ära', 'sunni', 'neid', 'hingama', 'tubakasuitsu', ',']",regular
"['""', '""', 'Arstilt', 'või', 'apteekrilt', 'saad', 'abi', 'suitsetamisest', 'loobumiseks', ',']",regular
"['""', '""', 'Suitsetamine', 'tekitab', 'kergesti', 'sõltuvust', ',']",regular
"['ära', 'alusta', ',']",regular


In [None]:
for clause in text.clauses:
    display(clause.morph_analysis)

### END

<a id='end'></a>