In [1]:
# Download the model before attempting to run this NB
# python -m spacy download xx_ent_wiki_sm

In [2]:
import os
import spacy

In [3]:
file_dir = '../data/ru/factRuEval-2016/devset'
devset = 'devset_combined.txt'
dev_path = os.path.join(file_dir, devset)

with open(os.path.join(file_dir, 'book_1667.txt')) as f:
    test_example = f.read()

In [4]:
nlp = spacy.load('xx_ent_wiki_sm')
doc = nlp(test_example)

In [5]:
spacy.displacy.render(doc, style='ent', jupyter=True)

In [6]:
with open(os.path.join(file_dir, 'book_1667.spans')) as f:
    print('Actual Tags:\n')
    print(f.read())

Actual Tags:

28823 loc_name 15 9 826611 1  # 826611 Пхеньяном
28824 loc_name 122 4 826624 1  # 826624 КНДР
70437 geo_adj 45 15 826614 1  # 826614 северокорейский
70438 job 61 10 826615 1  # 826615 перебежчик
70439 job 45 26 826614 2  # 826614 826615 северокорейский перебежчик
70352 org_name 146 29 826627 3  # 826627 826628 826629 Верховного народного собрания
70353 org_descr 167 8 826629 1  # 826629 собрания
70354 loc_name 176 4 826630 1  # 826630 КНДР
70355 surname 181 4 826631 1  # 826631 Хван
70356 name 186 4 826632 1  # 826632 Чжан
70357 name 191 2 826633 1  # 826633 Еп
70358 loc_name 351 8 826658 1  # 826658 Пхеньяна
70359 surname 435 4 826674 1  # 826674 Кима
70360 name 440 3 826675 1  # 826675 Чен
70361 name 444 3 826676 1  # 826676 Ира
70362 loc_name 448 4 826677 1  # 826677 КНДР
70363 loc_name 574 9 826705 1  # 826705 Пхеньяном
70364 loc_name 627 4 826712 1  # 826712 КНДР
70365 loc_name 679 5 826718 1  # 826718 Китая
70366 loc_name 760 11 826732 1  # 826732 Поднебесная
70367 

### Summary
This dataset is from factRuEval-2016. The "labels" in this dataset are somewhat unusual for NER - for example, the inclusion of "job" as a category, as well as the decision to split "person" into name, surname, and nickname. These are actually the data's "span" labels rather than "object" labels. Since the annotator markup of "objects" (direct analogues of entities) was done after the markup of spans, information about the actual indices/spans where the "object" is located was not preserved. Thus it was not possible to accurately and straightforwardly link objects to their locations in the text, and the span labels are the next best thing we have. 

This isn't actually a problem, since the span labels are arguably more granular and informative than the typical NER markup would be. It is only really a problem for assessing out-of-the-box performance for spaCy, since the model doesn't come with the ability to predict these labels. 

SpaCy's reported F1 scores for this model:

NER F 79.88

NER P 80.27

NER R 79.49
