Converting from conllu to json via cli vs. biluo_tags_from_offsets #5740

sbushmanov · 2020-07-09T16:33:32Z

How to reproduce the behaviour

I'm trying to convert from conllu format to json to use json formatted data in clitraining.

The following conversion scenario leads to error message while training:

spacy convert nerus_lenta.conllu ./nerus_json -l ru

The excerpt from resulting json:

[
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "id":0,
                "orth":"\u0412\u0438\u0446\u0435-\u043f\u0440\u0435\u043c\u044c\u0435\u0440",
                "tag":"NOUN",
                "head":6,
                "dep":"nsubj"
              },
              {
                "id":1,
                "orth":"\u043f\u043e",
                "tag":"ADP",
                "head":2,
                "dep":"case"
              },
              {
                "id":2,
                "orth":"\u0441\u043e\u0446\u0438\u0430\u043b\u044c\u043d\u044b\u043c",
                "tag":"ADJ",
                "head":1,
                "dep":"amod"
              },
              {
                "id":3,
                "orth":"\u0432\u043e\u043f\u0440\u043e\u0441\u0430\u043c",
                "tag":"NOUN",
                "head":-3,
                "dep":"nmod"
              },
              {
                "id":4,
                "orth":"\u0422\u0430\u0442\u044c\u044f\u043d\u0430",
                "tag":"PROPN",
                "head":-4,
                "dep":"appos"
              },
              {
                "id":5,
                "orth":"\u0413\u043e\u043b\u0438\u043a\u043e\u0432\u0430",
                "tag":"PROPN",
                "head":-1,
                "dep":"flat:name"
              },
              {
                "id":6,
                "orth":"\u0440\u0430\u0441\u0441\u043a\u0430\u0437\u0430\u043b\u0430",
                "tag":"VERB",
                "head":0,
                "dep":"ROOT"
              },
              {
                "id":7,
                "orth":",",
                "tag":"PUNCT",
                "head":5,
                "dep":"punct"
              },
              {
                "id":8,
                "orth":"\u0432",
                "tag":"ADP",
                "head":2,
                "dep":"case"
              },
              {
                "id":9,
                "orth":"\u043a\u0430\u043a\u0438\u0445",
                "tag":"DET",
                "head":1,
                "dep":"det"
              },
              {
                "id":10,
                "orth":"\u0440\u0435\u0433\u0438\u043e\u043d\u0430\u0445",
                "tag":"NOUN",
                "head":2,
                "dep":"obl"
              },
              {
                "id":11,
                "orth":"\u0420\u043e\u0441\u0441\u0438\u0438",
                "tag":"PROPN",
                "head":-1,
                "dep":"nmod"
              },
              {
                "id":12,
                "orth":"\u0437\u0430\u0444\u0438\u043a\u0441\u0438\u0440\u043e\u0432\u0430\u043d\u0430",
                "tag":"VERB",
                "head":-6,
                "dep":"ccomp"
              },
              {
                "id":13,
                "orth":"\u043d\u0430\u0438\u0431\u043e\u043b\u0435\u0435",
                "tag":"ADV",
                "head":1,
                "dep":"advmod"
              },
              {
                "id":14,
                "orth":"\u0432\u044b\u0441\u043e\u043a\u0430\u044f",
                "tag":"ADJ",
                "head":1,
                "dep":"amod"
              },
              {
                "id":15,
                "orth":"\u0441\u043c\u0435\u0440\u0442\u043d\u043e\u0441\u0442\u044c",
                "tag":"NOUN",
                "head":-3,
                "dep":"nsubj:pass"
              },
              {
                "id":16,
                "orth":"\u043e\u0442",
                "tag":"ADP",
                "head":1,
                "dep":"case"
              },
              {
                "id":17,
                "orth":"\u0440\u0430\u043a\u0430",
                "tag":"NOUN",
                "head":-2,
                "dep":"nmod"
              },
              {
                "id":18,
                "orth":",",
                "tag":"PUNCT",
                "head":1,
                "dep":"punct"
              },
              {
                "id":19,
                "orth":"\u0441\u043e\u043e\u0431\u0449\u0430\u0435\u0442",
                "tag":"VERB",
                "head":0,
                "dep":"ROOT"
              },
              {
                "id":20,
                "orth":"\u0420\u0418\u0410",
                "tag":"PROPN",
                "head":-1,
                "dep":"nsubj"
              },
              {
                "id":21,
                "orth":"\u041d\u043e\u0432\u043e\u0441\u0442\u0438",
                "tag":"PROPN",
                "head":-1,
                "dep":"appos"
              },
              {
                "id":22,
                "orth":".",
                "tag":"PUNCT",
                "head":-3,
                "dep":"punct"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":1,

The error message:

Traceback (most recent call last):
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/cli/train.py", line 248, in train
    for batch in util.minibatch_by_words(train_docs, size=batch_sizes):
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/util.py", line 535, in minibatch_by_words
    doc, gold = next(items)
  File "gold.pyx", line 217, in train_docs
  File "gold.pyx", line 233, in iter_gold_docs
  File "gold.pyx", line 253, in spacy.gold.GoldCorpus._make_golds
  File "gold.pyx", line 443, in spacy.gold.GoldParse.from_annot_tuples
  File "gold.pyx", line 593, in spacy.gold.GoldParse.__init__
ValueError: [E069] Invalid gold-standard parse tree. Found cycle between word IDs: {3, 5}

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/__main__.py", line 35, in <module>
    plac.call(commands[command], sys.argv[1:])
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/cli/train.py", line 368, in train
    best_model_path = _collate_best_model(meta, output_path, nlp.pipe_names)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/cli/train.py", line 431, in _collate_best_model
    path2str(best_component_src / component), path2str(best_dest / component)
TypeError: unsupported operand type(s) for /: 'NoneType' and 'str'

However when I try conversion from spacy's "simple" format to json via:

data_train_json = []
for text, annot in data_train_raw:
    try:
        doc = nlp(text)
        tags = biluo_tags_from_offsets(doc, annot['entities'])
        entities = spans_from_biluo_tags(doc, tags)
        doc.ents = entities
        data_train_json.append(doc)

everything, including training, goes fine.

The resulting json seems to be different (note raw tag and ner), but working:

[
  {
    "id":0,
    "paragraphs":[
      {
        "raw":"\u041e \u043f\u043e\u0434\u043e\u0440\u043e\u0436\u0430\u043d\u0438\u0438 \u043c\u044f\u0441\u0430 \u0437\u0430\u044f\u0432\u0438\u043b\u0438 72 \u043f\u0440\u043e\u0446\u0435\u043d\u0442\u0430 \u0440\u043e\u0441\u0441\u0438\u044f\u043d, \u0430 \u043d\u0430 \u043c\u043e\u043b\u043e\u0447\u043d\u044b\u0435 \u043f\u0440\u043e\u0434\u0443\u043a\u0442\u044b - 68 \u043f\u0440\u043e\u0446\u0435\u043d\u0442\u043e\u0432.",
        "sentences":[
          {
            "tokens":[
              {
                "id":0,
                "orth":"\u041e",
                "tag":"ADP___",
                "head":1,
                "dep":"case",
                "ner":"O"
              },
              {
                "id":1,
                "orth":"\u043f\u043e\u0434\u043e\u0440\u043e\u0436\u0430\u043d\u0438\u0438",
                "tag":"NOUN__Animacy=Inan|Case=Loc|Gender=Neut|Number=Sing",
                "head":2,
                "dep":"obl",
                "ner":"O"
              },
              {
                "id":2,
                "orth":"\u043c\u044f\u0441\u0430",
                "tag":"NOUN__Animacy=Inan|Case=Gen|Gender=Neut|Number=Sing",
                "head":-1,
                "dep":"nmod",
                "ner":"O"
              },
              {
                "id":3,
                "orth":"\u0437\u0430\u044f\u0432\u0438\u043b\u0438",
                "tag":"VERB__Aspect=Perf|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Act",
                "head":0,
                "dep":"ROOT",
                "ner":"O"
              },

Question:
Is me converting through cli wrong and could it be cured?

Your Environment

Operating System: Ubuntu 18.04
Python Version Used: 3.7.6
spaCy Version Used: 2.1.9
Environment Information:

The text was updated successfully, but these errors were encountered:

adrianeboyd · 2020-07-10T06:25:37Z

The two conversions are doing very different things. The first one is converting tags and dependency parses from conllu to json (note that there's no NER info found by the converter -- it's likely the spacy 2.1.9 converter doesn't handle your particular conllu+NER format) and the second one is using automatically tagged and parsed docs with entities added from your annotation.

Which model(s) are you trying to train? Just an NER model? Also a tagger and a parser? Can you show a sample of your conllu data (at least one full sentence, anonymized if needed)?

sbushmanov · 2020-07-10T09:34:08Z

An excerpt from the source conllu file (the misbehaving sentence). Note, ner tag info (Tag=B-LOC eg) is present in the source file:

# sent_id = 0_2
# text = Вице-премьер напомнила, что главные факторы смертности в России — рак и болезни системы кровообращения.
1	Вице-премьер	_	NOUN	_	Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing	2	nsubj	_	Tag=O
2	напомнила	_	VERB	_	Aspect=Perf|Gender=Fem|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act	0	root	_	Tag=O
3	,	_	PUNCT	_	_	11	punct	_	Tag=O
4	что	_	SCONJ	_	_	11	mark	_	Tag=O
5	главные	_	ADJ	_	Case=Nom|Degree=Pos|Number=Plur	6	amod	_	Tag=O
6	факторы	_	NOUN	_	Animacy=Inan|Case=Nom|Gender=Masc|Number=Plur	11	nsubj	_	Tag=O
7	смертности	_	NOUN	_	Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing	6	nmod	_	Tag=O
8	в	_	ADP	_	_	9	case	_	Tag=O
9	России	_	PROPN	_	Animacy=Inan|Case=Loc|Gender=Fem|Number=Sing	7	nmod	_	Tag=B-LOC
10	—	_	PUNCT	_	_	11	punct	_	Tag=O
11	рак	_	NOUN	_	Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing	6	nsubj	_	Tag=O
12	и	_	CCONJ	_	_	13	cc	_	Tag=O
13	болезни	_	NOUN	_	Animacy=Inan|Case=Nom|Gender=Fem|Number=Plur	11	conj	_	Tag=O
14	системы	_	NOUN	_	Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing	13	nmod	_	Tag=O
15	кровообращения	_	NOUN	_	Animacy=Inan|Case=Gen|Gender=Neut|Number=Sing	14	nmod	_	Tag=O
16	.	_	PUNCT	_	_	2	punct	_	Tag=O

The corresponding generated json part (without ner tag and causing cycle):

  {
    "id": 2,
    "paragraphs": [
      {
        "sentences": [
          {
            "tokens": [
              {
                "id": 0,
                "orth": "Вице-премьер",
                "tag": "NOUN",
                "head": 1,
                "dep": "nsubj"
              },
              {
                "id": 1,
                "orth": "напомнила",
                "tag": "VERB",
                "head": 0,
                "dep": "ROOT"
              },
              {
                "id": 2,
                "orth": ",",
                "tag": "PUNCT",
                "head": 8,
                "dep": "punct"
              },
              {
                "id": 3,
                "orth": "что",
                "tag": "SCONJ",
                "head": 7,
                "dep": "mark"
              },
              {
                "id": 4,
                "orth": "главные",
                "tag": "ADJ",
                "head": 1,
                "dep": "amod"
              },
              {
                "id": 5,
                "orth": "факторы",
                "tag": "NOUN",
                "head": 5,
                "dep": "nsubj"
              },
              {
                "id": 6,
                "orth": "смертности",
                "tag": "NOUN",
                "head": -1,
                "dep": "nmod"
              },
              {
                "id": 7,
                "orth": "в",
                "tag": "ADP",
                "head": 1,
                "dep": "case"
              },
              {
                "id": 8,
                "orth": "России",
                "tag": "PROPN",
                "head": -2,
                "dep": "nmod"
              },
              {
                "id": 9,
                "orth": "—",
                "tag": "PUNCT",
                "head": 1,
                "dep": "punct"
              },
              {
                "id": 10,
                "orth": "рак",
                "tag": "NOUN",
                "head": -5,
                "dep": "nsubj"
              },
              {
                "id": 11,
                "orth": "и",
                "tag": "CCONJ",
                "head": 1,
                "dep": "cc"
              },
              {
                "id": 12,
                "orth": "болезни",
                "tag": "NOUN",
                "head": -2,
                "dep": "conj"
              },
              {
                "id": 13,
                "orth": "системы",
                "tag": "NOUN",
                "head": -1,
                "dep": "nmod"
              },
              {
                "id": 14,
                "orth": "кровообращения",
                "tag": "NOUN",
                "head": -1,
                "dep": "nmod"
              },
              {
                "id": 15,
                "orth": ".",
                "tag": "PUNCT",
                "head": -14,
                "dep": "punct"
              }
            ]
          }
        ]
      }
    ]
  },

The training error message:

ValueError: [E069] Invalid gold-standard parse tree. Found cycle between word IDs: {2, 10, 5} (tokens: ',' 'рак' 'факторы') in the document starting with tokens: Вице-премьер напомнила , что главные факторы смертности в России — рак и болезни системы кровообращения ..

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/cli/train.py", line 570, in train
    exits=1,
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/wasabi/printer.py", line 90, in warn
    title, text, style=MESSAGES.WARN, show=show, spaced=spaced, exits=exits
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/wasabi/printer.py", line 236, in _get_msg
    title, text, color=style, icon=style, show=show, spaced=spaced, exits=exits
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/wasabi/printer.py", line 144, in text
    sys.exit(exits)
SystemExit: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/__main__.py", line 33, in <module>
    plac.call(commands[command], sys.argv[1:])
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/cli/train.py", line 615, in train
    best_model_path = _collate_best_model(final_meta, output_path, best_pipes)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/cli/train.py", line 674, in _collate_best_model
    path2str(best_component_src / component), path2str(best_dest / component)
TypeError: unsupported operand type(s) for /: 'NoneType' and 'str'

Actually neither spacy 2.3.1 nor 2.1.9 is able to convert to json so that model can train on resulting files.

I'm training just ner

sbushmanov · 2020-07-11T09:41:03Z

I tried a different Russian corpus from here but still error messages while training:

spacy convert ru_syntagrus-ud-dev_w_xml.conllu ./ -c ner

spacy train ru /home/sergey/Py_Spacy_RU/test  /home/sergey/Py_Spacy_RU/data/syntagrus/ru_syntagrus-ud-train_w_xml.json /home/sergey/Py_Spacy_RU/data/syntagrus/ru_syntagrus-ud-test_w_xml.json --base-model /home/sergey/Py_Spacy_RU/ru2 --n-iter 20 --n-early-stopping 5 --pipeline 'ner'

Training pipeline: ['ner']
Starting with base model '/home/sergey/Py_Spacy_RU/ru2'
Counting training words (limit=0)

Itn    Dep Loss    NER Loss      UAS    NER P    NER R    NER F    Tag %  Token %  CPU WPS  GPU WPS
---  ----------  ----------  -------  -------  -------  -------  -------  -------  -------  -------
✔ Saved model to output directory                                                                                                      
/home/sergey/Py_Spacy_RU/test/model-final
⠙ Creating best model...
Traceback (most recent call last):
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/cli/train.py", line 257, in train
    losses=losses,
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/language.py", line 475, in update
    proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
  File "nn_parser.pyx", line 414, in spacy.syntax.nn_parser.Parser.update
  File "nn_parser.pyx", line 517, in spacy.syntax.nn_parser.Parser._init_gold_batch
  File "ner.pyx", line 106, in spacy.syntax.ner.BiluoPushDown.preprocess_gold
  File "ner.pyx", line 165, in spacy.syntax.ner.BiluoPushDown.lookup_transition
KeyError: "[E022] Could not find a transition with the name 'U-' in the NER model."

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/__main__.py", line 35, in <module>
    plac.call(commands[command], sys.argv[1:])
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/cli/train.py", line 368, in train
    best_model_path = _collate_best_model(meta, output_path, nlp.pipe_names)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/cli/train.py", line 431, in _collate_best_model
    path2str(best_component_src / component), path2str(best_dest / component)
TypeError: unsupported operand type(s) for /: 'NoneType' and 'str'

Am I missing something or there is a bug?

adrianeboyd · 2020-07-13T09:04:06Z

I'll close this issue since I think the underlying question is addressed in #5753.

github-actions · 2021-11-03T00:02:01Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

adrianeboyd added feat / cli Feature: Command-line interface training Training and updating models more-info-needed This issue needs more information labels Jul 10, 2020

no-response bot removed the more-info-needed This issue needs more information label Jul 10, 2020

adrianeboyd closed this as completed Jul 13, 2020

github-actions bot locked as resolved and limited conversation to collaborators Nov 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Converting from conllu to json via cli vs. biluo_tags_from_offsets #5740

Converting from conllu to json via cli vs. biluo_tags_from_offsets #5740

sbushmanov commented Jul 9, 2020 •

edited

adrianeboyd commented Jul 10, 2020

sbushmanov commented Jul 10, 2020 •

edited

sbushmanov commented Jul 11, 2020 •

edited

adrianeboyd commented Jul 13, 2020

github-actions bot commented Nov 3, 2021

Converting from conllu to json via cli vs. biluo_tags_from_offsets #5740

Converting from conllu to json via cli vs. biluo_tags_from_offsets #5740

Comments

sbushmanov commented Jul 9, 2020 • edited

How to reproduce the behaviour

Your Environment

adrianeboyd commented Jul 10, 2020

sbushmanov commented Jul 10, 2020 • edited

sbushmanov commented Jul 11, 2020 • edited

adrianeboyd commented Jul 13, 2020

github-actions bot commented Nov 3, 2021

sbushmanov commented Jul 9, 2020 •

edited

sbushmanov commented Jul 10, 2020 •

edited

sbushmanov commented Jul 11, 2020 •

edited