Skip to content
NER system for South Slavic languages
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore
LICENSE
README.md
eval.py
example_hr.txt
example_sl.txt
hr.brown
hr.conll
hr.ner.model
sl.brown
sl.conll
sl.lower.brown
sl.lower.conll
sl.lower.ner.model
sl.ner.model
sr.brown
sr.conll
sr.ner.model
tagger.py
train_tagger.py

README.md

janes-ner

NER system for South Slavic languages

$ python tagger.py sl < example_sl.txt
Slovenija	Npfsn	B-loc
je	Va-r3s-n	O
zelo	Rgp	O
# kot Hrvaška #	Z Rgp Npfsn Z	O O B-loc O
lepa	Agpfsn	O
.	Z	O

$ python tagger.py hr < example_hr.txt
Dodali	Vmp-pm	O
smo	Var1p	O
i	Qo	O
preostale	Agpmpay	O
jezike	Ncmpa	O
.	Z	O

Marko	Npmsn	B-per
i	Cc	O
Ana	Npfsn	B-per
rade	Vmr3p	O
u	Sl	O
Microsoftu	Npmsl	B-org
u	Sl	O
Jajcu	Ncnsl	B-loc
.	Z	O

Necessary preprocessing

To produce data that is tokenised and part-of-speech-tagged (prerequisite for named entity recognition), you should apply the following tools to running text:

One exemplary run of these processes in a pipeline is this:

$ echo 'U Piranu pada kiša.' | python reldi-tokeniser/tokeniser.py hr | python reldi-tagger/tagger.py hr | python janes-ner/tagger.py -i 2 -m 3 hr
1.1.1.1-1	U	Sl	O
1.1.2.3-8	Piranu	Npmsl	B-loc
1.1.3.10-13	pada	Vmr3s	O
1.1.4.15-18	kiša	Ncfsn	O
1.1.5.19-19	.	Z	O

Evaluation

The tagger was evaluated on different flavours of Slovene held-out data: standard data, non-standard data and mixture of standard and non-standard data.

The evaluation results on the standard data are the following:

             precision    recall  f1-score   support

          o       0.99      1.00      0.99     36938
  deriv-per       0.44      0.56      0.49        27
        loc       0.85      0.74      0.79       582
       misc       0.39      0.24      0.30       315
        org       0.69      0.48      0.57       497
        per       0.87      0.95      0.91       819

avg / total       0.98      0.98      0.98     39178

The evaluation results on the non-standard data are these:

             precision    recall  f1-score   support

          o       0.99      1.00      1.00      1740
  deriv-per       0.00      0.00      0.00         1
        loc       0.79      0.92      0.85        12
       misc       0.75      0.21      0.33        14
        org       0.50      0.33      0.40         6
        per       0.98      1.00      0.99        82

avg / total       0.99      0.99      0.99      1855

The evaluation results on the mixture of standard and non-standard data are these:

             precision    recall  f1-score   support

          o       0.99      1.00      0.99     40418
  deriv-per       0.44      0.52      0.48        29
        loc       0.85      0.75      0.80       606
       misc       0.41      0.24      0.30       343
        org       0.69      0.48      0.56       509
        per       0.88      0.96      0.92       983

avg / total       0.98      0.98      0.98     42888

Citing the tagger

If you use the tagger, please cite the following paper:

@Article{Fišer2018,
author="Fi{\v{s}}er, Darja and Ljube{\v{s}}i{\'{c}}, Nikola and Erjavec, Toma{\v{z}}",
title="The Janes project: language resources and tools for Slovene user generated content",
journal="Language Resources and Evaluation",
year="2018",
issn="1574-0218",
doi="10.1007/s10579-018-9425-z",
url="https://doi.org/10.1007/s10579-018-9425-z"
}
You can’t perform that action at this time.