# Experiment with homonym "muna"

## Table of contents
   1. [**Gathering Data**](#andmete_kogumine)
   2. [**Model training**](#margendaja_treenimine)
   3. [**Tagger Evaluation**](#margendaja_hindamine)
   4. [**Evaluation on UD treebank**](#margendaja_hindamine_puudepangal)
   5. [**Results**](#tulemused)


[end](#end)

In [1]:
import os
import json
import warnings
import evaluate
import pkg_resources
import types
import pandas as pd
import numpy as np
import estnltk, estnltk.converters, estnltk.taggers
import sklearn

from bert_morph_tagger_notebook_functions import NotebookFunctions
from simpletransformers.ner import NERModel, NERArgs
from tqdm import tqdm
from bert_morph_tagger import BertMorphTagger

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Get locally imported modules from current notebook - https://stackoverflow.com/questions/40428931/package-for-listing-version-of-packages-used-in-a-jupyter-notebook - Alex P. Miller
def get_imports():
    
    for name, val in globals().items():
        if isinstance(val, types.ModuleType):
            # Split ensures you get root package, 
            # not just imported function
            name = val.__name__.split(".")[0]

        elif isinstance(val, type):
            name = val.__module__.split(".")[0]
            
        # Some packages are weird and have different
        # imported names vs. system/pip names. Unfortunately,
        # there is no systematic way to get pip names from
        # a package's imported name. You'll have to add
        # exceptions to this list manually!
        poorly_named_packages = {
            "PIL": "Pillow",
            "sklearn": "scikit-learn"
        }
        if name in poorly_named_packages.keys():
            name = poorly_named_packages[name]
            
        yield name
imports = list(set(get_imports()))

# The only way I found to get the version of the root package
# from only the name of the package is to cross-check the names 
# of installed packages vs. imported packages
requirements = []
for m in pkg_resources.working_set:
    if m.project_name in imports and m.project_name!="pip":
        requirements.append((m.project_name, m.version))

for r in requirements:
    print("{}=={}".format(*r))

estnltk==1.7.3
evaluate==0.4.2
matplotlib==3.9.0
numpy==1.26.4
pandas==2.2.2
scikit-learn==1.5.1
simpletransformers==0.70.1
torch==2.5.1
tqdm==4.66.5


<a id='andmete_kogumine'></a>

### Gathering data

In [14]:
lines = []

with open('./100_manually_corrected_muna_sentences.jl', 'r') as file:
    for line in file:
        lines.append(json.loads(line))

data = pd.DataFrame(lines)

In [15]:
display(data.head())

Unnamed: 0,corpus,doc_id,text,correct_forms
0,nc19_Balanced_Corpus.vert,1751,"""Salmonella levib hästi piimatoitudes, kus on kasutatud toorest muna,"" nimetab tohter ühe ohuallikana paljude lemmikmaiust tiramisut.","[{'word': 'muna', 'form': 'S_sg p', 'start': 64, 'end': 68}]"
1,nc19_Balanced_Corpus.vert,2471,"(1994) Tshak-tshak Taignaks 400 g nisujahu, 160 g muna, 67 g piima, 10 g suhkrut, 22 g soola, 210 g sulavõid.","[{'word': 'muna', 'form': 'S_sg p', 'start': 50, 'end': 54}]"
2,nc19_Balanced_Corpus.vert,2917,Geko igatseb kaasat Hallikirjut kummimänguasja meenutav tokee geko valvab kaht oma abikaasa munetud muna.,"[{'word': 'muna', 'form': 'S_sg p', 'start': 100, 'end': 104}]"
3,nc19_Balanced_Corpus.vert,3301,Nüüd tuleb lisada suhkur ja ükshaaval kuus muna.,"[{'word': 'muna', 'form': 'S_sg p', 'start': 43, 'end': 47}]"
4,nc19_Balanced_Corpus.vert,4057,"Ülestõusmispühadeks pidi iga hellake muna keetma ja selle värvima, koos viidi siis need vanadekodusse.","[{'word': 'muna', 'form': 'S_sg g', 'start': 37, 'end': 41}]"


<a id='margendaja_treenimine'></a>

### Model training

Creating required dataset

In [3]:
if not os.path.exists("./muna_data.csv"):
    rows = []
    sentence_id = 0

    for d in data.iterrows():
        text = d[1].text
        text_obj = estnltk.Text(text)
        morph_analysis = text_obj.tag_layer('morph_analysis')
        i = 0
        for sentence in morph_analysis.sentences:
            sentence_analysis = sentence.morph_analysis
            for text_part, form, pos in zip(sentence_analysis.text, sentence_analysis.form, sentence_analysis.partofspeech):
                if text_part == "muna":
                    pos_form = d[1]['correct_forms'][i]['form'].split('_')
                    rows.append((sentence_id, text_part, pos_form[1], pos_form[0]))
                    i+=1
                    continue
                if text_part:
                    rows.append((sentence_id, text_part, form[0], pos[0])) # In case of multiplicity, select the first or index 0
            sentence_id += 1

    train_df = pd.DataFrame(rows, columns = ['sentence_id', 'words', 'form', 'pos'])
    NotebookFunctions.create_labels_column(train_df, "muna_data.csv")
    train_df

else:
    train_df = pd.read_csv('muna_data.csv')

Initializing the model<!-- Mudeli ülesehitamine -->

In [4]:
unique_labels = NotebookFunctions.get_unique_labels("./unique_labels.json")
model = NotebookFunctions.initialize_model("NER_mudel_v2", unique_labels=unique_labels)

Set directories where new model will be created and training parameters

In [None]:
model.args.output_dir = 'NER_mudel_v2_muna_1'
model.args.cache_dir = os.path.join(model.args.output_dir, 'cache')
model.args.best_model_dir = os.path.join(model.args.output_dir, 'best_model')
model.args.num_train_epochs = 20
model.args.learning_rate = 1e-5

Train the model

In [None]:
if not os.path.exists(os.path.join('./' + model.args.output_dir, 'config.json')):
    # Train model
    print("Training model")
    model.train_model(train_df, output_dir=model.args.output_dir, only_muna="muna")
else:
    model = NotebookFunctions.initialize_model(model.args.output_dir, unique_labels)

Training model


100%|██████████| 100/100 [00:00<00:00, 4000.02it/s]
  scaler = amp.GradScaler()
  with amp.autocast():
Epochs 1/20. Running Loss:    0.5540: 100%|██████████| 13/13 [00:01<00:00, 11.02it/s]
Epochs 2/20. Running Loss:    0.8486: 100%|██████████| 13/13 [00:00<00:00, 16.18it/s]
Epochs 3/20. Running Loss:    0.9083: 100%|██████████| 13/13 [00:00<00:00, 15.99it/s]
Epochs 4/20. Running Loss:    0.3594: 100%|██████████| 13/13 [00:00<00:00, 15.04it/s]
Epochs 5/20. Running Loss:    0.2994: 100%|██████████| 13/13 [00:00<00:00, 15.90it/s]
Epochs 6/20. Running Loss:    0.1518: 100%|██████████| 13/13 [00:00<00:00, 16.95it/s]
Epochs 7/20. Running Loss:    0.1544: 100%|██████████| 13/13 [00:00<00:00, 16.97it/s]
Epochs 8/20. Running Loss:    0.2860: 100%|██████████| 13/13 [00:00<00:00, 16.77it/s]
Epochs 9/20. Running Loss:    0.0619: 100%|██████████| 13/13 [00:00<00:00, 16.53it/s]
Epochs 10/20. Running Loss:    0.0250: 100%|██████████| 13/13 [00:00<00:00, 16.56it/s]
Epochs 11/20. Running Loss:    0.076

Set directories where new model will be created and training parameters

In [None]:
model.args.output_dir = 'NER_mudel_v2_muna_2'
model.args.cache_dir = os.path.join(model.args.output_dir, 'cache')
model.args.best_model_dir = os.path.join(model.args.output_dir, 'best_model')
model.args.num_train_epochs = 20
model.args.learning_rate = 1e-5

Train the model

In [None]:
if not os.path.exists(".\\NER_mudel_v2_muna_2\\config.json"):
    # Train model
    print("Training model")
    model.train_model(train_df, output_dir=model.args.output_dir, only_muna="all")
else:
    model = NotebookFunctions.initialize_model(model.args.output_dir, unique_labels)

Training model


100%|██████████| 100/100 [00:00<00:00, 3333.55it/s]
Epochs 1/20. Running Loss:    0.0022: 100%|██████████| 13/13 [00:00<00:00, 16.08it/s]
Epochs 2/20. Running Loss:    0.0132: 100%|██████████| 13/13 [00:00<00:00, 16.31it/s]
Epochs 3/20. Running Loss:    0.0129: 100%|██████████| 13/13 [00:00<00:00, 15.53it/s]
Epochs 4/20. Running Loss:    0.0024: 100%|██████████| 13/13 [00:00<00:00, 15.34it/s]
Epochs 5/20. Running Loss:    0.0024: 100%|██████████| 13/13 [00:00<00:00, 16.05it/s]
Epochs 6/20. Running Loss:    0.0037: 100%|██████████| 13/13 [00:00<00:00, 15.24it/s]
Epochs 7/20. Running Loss:    0.0021: 100%|██████████| 13/13 [00:00<00:00, 16.18it/s]
Epochs 8/20. Running Loss:    0.0956: 100%|██████████| 13/13 [00:00<00:00, 17.14it/s]
Epochs 9/20. Running Loss:    0.0015: 100%|██████████| 13/13 [00:00<00:00, 16.60it/s]
Epochs 10/20. Running Loss:    0.0816: 100%|██████████| 13/13 [00:00<00:00, 16.98it/s]
Epochs 11/20. Running Loss:    0.0059: 100%|██████████| 13/13 [00:00<00:00, 16.72it/s]


<a id='margendaja_hindamine'></a>

### Tagger Evaluation

In [5]:
models = ['./NER_mudel/', './NER_mudel_v2/', './NER_mudel_v2_muna_1/', './NER_mudel_v2_muna_2/']

In [6]:
warnings.filterwarnings('ignore', module='bert_tokens_to_words_rewriter')

In [10]:
def get_results(data):
    results = []
    for _, d in data[['text', 'correct_forms']].iterrows():
        correct_forms = d['correct_forms']
        text_obj = estnltk.Text(d['text'])
        text_obj.tag_layer('sentences')
        morph_tagger.tag(text_obj)

        span_dict = {
            (span.start, span.end): span.partofspeech[0] + '_' + span.form[0]
            for span in text_obj.bert_morph_tagging.spans
        }

        for correct_form in correct_forms:
            key = (correct_form['start'], correct_form['end'])
            if key in span_dict:
                predicted_form = span_dict[key]
                is_same = predicted_form == correct_form['form']
                # print(f"Tagger: {predicted_form}, Correct form: {correct_form['form']}, Same: {is_same}")
                results.append(is_same)
    return results

In [11]:
def evaluate_results(results):

    results_binary = np.array([int(is_correct) for is_correct in results])
    true_values = np.ones(len(results_binary), dtype=int)

    accuracy = sklearn.metrics.accuracy_score(results_binary, true_values)
    precision = sklearn.metrics.precision_score(results_binary, true_values, zero_division=0)
    # recall = sklearn.metrics.recall_score(results_binary, true_values, zero_division=0)
    f1 = sklearn.metrics.f1_score(results_binary, true_values, zero_division=0)

    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    # print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")
    unique, counts = np.unique(results_binary, return_counts=True)
    print(dict(zip(unique, counts)))

In [16]:
for model_path in models:
    morph_tagger = BertMorphTagger(model_path)
    print(f"Model_path: {model_path}")
    evaluate_results(get_results(data))

Model_path: ./NER_mudel/
Accuracy: 0.8350
Precision: 0.8350
F1 Score: 0.9101
{0: 17, 1: 86}
Model_path: ./NER_mudel_v2/
Accuracy: 0.9320
Precision: 0.9320
F1 Score: 0.9648
{0: 7, 1: 96}
Model_path: ./NER_mudel_v2_muna_1/
Accuracy: 0.9903
Precision: 0.9903
F1 Score: 0.9951
{0: 1, 1: 102}
Model_path: ./NER_mudel_v2_muna_2/
Accuracy: 0.9903
Precision: 0.9903
F1 Score: 0.9951
{0: 1, 1: 102}


<a id='margendaja_hindamine_puudepangal'></a>

### Evaluation on UD

In [16]:
train_df_ud = pd.read_csv('UD_train.csv', keep_default_na=False)
test_df_ud = pd.read_csv('UD_test.csv', keep_default_na=False)
dev_df_ud = pd.read_csv('UD_dev.csv', keep_default_na=False)

Removing unnecessary columns for the model<!-- Mudelile ebavajalike veergude eemaldamine -->

In [17]:
train_df_ud = train_df_ud.drop(labels=['file_prefix', 'source'], axis=1)
test_df_ud = test_df_ud.drop(labels=['file_prefix', 'source'], axis=1)
dev_df_ud = dev_df_ud.drop(labels=['file_prefix', 'source'], axis=1)

In [None]:
poseval = evaluate.load("evaluate-metric/poseval", module_type="metric")

def custom_metrics(preds, labels):

    # Evaluate using poseval
    result = poseval.compute(predictions=preds, references=labels)

    return result

model = NotebookFunctions.initialize_model("NER_mudel_v2_muna_1", unique_labels=unique_labels)

# Evaluate the model
result, model_outputs, preds_list = model.eval_model(test_df_ud, extra_metrics=custom_metrics)

print(f"Evaluation Loss:{result['eval_loss']:.4f}")
print(f"Precision: \t{result['extra_metrics']['weighted avg']['precision']:.4f}")
print(f"Recall: \t{result['extra_metrics']['weighted avg']['recall']:.4f}")
print(f"F1 Score: \t{result['extra_metrics']['weighted avg']['f1-score']:.4f}")

100%|██████████| 3/3 [00:09<00:00,  3.14s/it]
  with amp.autocast():
Running Evaluation: 100%|██████████| 12/12 [00:01<00:00,  6.76it/s]


Evaluation Loss:0.3237
Precision: 	0.9568
Recall: 	0.9466
F1 Score: 	0.9474


In [None]:
poseval = evaluate.load("evaluate-metric/poseval", module_type="metric")

def custom_metrics(preds, labels):

    # Evaluate using poseval
    result = poseval.compute(predictions=preds, references=labels)

    return result

model = NotebookFunctions.initialize_model("NER_mudel_v2_muna_2", unique_labels=unique_labels)

# Evaluate the model
result, model_outputs, preds_list = model.eval_model(test_df_ud, extra_metrics=custom_metrics)

print(f"Evaluation Loss:{result['eval_loss']:.4f}")
print(f"Precision: \t{result['extra_metrics']['weighted avg']['precision']:.4f}")
print(f"Recall: \t{result['extra_metrics']['weighted avg']['recall']:.4f}")
print(f"F1 Score: \t{result['extra_metrics']['weighted avg']['f1-score']:.4f}")

100%|██████████| 3/3 [00:09<00:00,  3.20s/it]
  with amp.autocast():
Running Evaluation: 100%|██████████| 12/12 [00:01<00:00,  6.73it/s]


Evaluation Loss:0.3896
Precision: 	0.9559
Recall: 	0.9455
F1 Score: 	0.9464


<a id='tulemused'></a>

### Results

Predicting label to word "muna"

| Model                    | Accuracy  | Precision | F1 score |
|--------------------------|-----------|-----------|----------|
| Vabamorf                 | 0.8350    | 0.8350    | 0.9101   |
| Bert_morph_v2            | 0.9320    | 0.9320    | 0.9648   |
| Bert_morph_v2_muna_1[^1] | 0.9903    | 0.9903    | 0.9951   |
| Bert_morph_v2_muna_2[^2] | 0.9903    | 0.9903    | 0.9951   |

Evaluations on UD treebank

| Model                    | Precision | Recall | F1 score |
|--------------------------|-----------|--------|----------|
| Vabamorf                 | 0.9194    | 0.9067 | 0.9082   |
| Bert_morph_v2            | 0.9778    | 0.9765 | 0.9769   |
| Bert_morph_v2_muna_1[^1] | 0.9568    | 0.9466 | 0.9474   |
| Bert_morph_v2_muna_2[^2] | 0.9559    | 0.9455 | 0.9464   |

\* Metrics are from weighted average

[^1]: Bert_morph_v2 model trained on the finetuning phase, on the muna-training set completely ignoring non-"muna" tokens
[^2]: Bert_morph_v2 model trained on the finetuning phase, on the muna-training set including other tokens in the loss with a smaller weight

<a id='end'></a>