Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converting from conllu to json via cli vs. biluo_tags_from_offsets #5740

Closed
sbushmanov opened this issue Jul 9, 2020 · 5 comments
Closed
Labels
feat / cli Feature: Command-line interface training Training and updating models

Comments

@sbushmanov
Copy link

sbushmanov commented Jul 9, 2020

How to reproduce the behaviour

I'm trying to convert from conllu format to json to use json formatted data in clitraining.

The following conversion scenario leads to error message while training:

spacy convert nerus_lenta.conllu ./nerus_json -l ru

The excerpt from resulting json:

[
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "id":0,
                "orth":"\u0412\u0438\u0446\u0435-\u043f\u0440\u0435\u043c\u044c\u0435\u0440",
                "tag":"NOUN",
                "head":6,
                "dep":"nsubj"
              },
              {
                "id":1,
                "orth":"\u043f\u043e",
                "tag":"ADP",
                "head":2,
                "dep":"case"
              },
              {
                "id":2,
                "orth":"\u0441\u043e\u0446\u0438\u0430\u043b\u044c\u043d\u044b\u043c",
                "tag":"ADJ",
                "head":1,
                "dep":"amod"
              },
              {
                "id":3,
                "orth":"\u0432\u043e\u043f\u0440\u043e\u0441\u0430\u043c",
                "tag":"NOUN",
                "head":-3,
                "dep":"nmod"
              },
              {
                "id":4,
                "orth":"\u0422\u0430\u0442\u044c\u044f\u043d\u0430",
                "tag":"PROPN",
                "head":-4,
                "dep":"appos"
              },
              {
                "id":5,
                "orth":"\u0413\u043e\u043b\u0438\u043a\u043e\u0432\u0430",
                "tag":"PROPN",
                "head":-1,
                "dep":"flat:name"
              },
              {
                "id":6,
                "orth":"\u0440\u0430\u0441\u0441\u043a\u0430\u0437\u0430\u043b\u0430",
                "tag":"VERB",
                "head":0,
                "dep":"ROOT"
              },
              {
                "id":7,
                "orth":",",
                "tag":"PUNCT",
                "head":5,
                "dep":"punct"
              },
              {
                "id":8,
                "orth":"\u0432",
                "tag":"ADP",
                "head":2,
                "dep":"case"
              },
              {
                "id":9,
                "orth":"\u043a\u0430\u043a\u0438\u0445",
                "tag":"DET",
                "head":1,
                "dep":"det"
              },
              {
                "id":10,
                "orth":"\u0440\u0435\u0433\u0438\u043e\u043d\u0430\u0445",
                "tag":"NOUN",
                "head":2,
                "dep":"obl"
              },
              {
                "id":11,
                "orth":"\u0420\u043e\u0441\u0441\u0438\u0438",
                "tag":"PROPN",
                "head":-1,
                "dep":"nmod"
              },
              {
                "id":12,
                "orth":"\u0437\u0430\u0444\u0438\u043a\u0441\u0438\u0440\u043e\u0432\u0430\u043d\u0430",
                "tag":"VERB",
                "head":-6,
                "dep":"ccomp"
              },
              {
                "id":13,
                "orth":"\u043d\u0430\u0438\u0431\u043e\u043b\u0435\u0435",
                "tag":"ADV",
                "head":1,
                "dep":"advmod"
              },
              {
                "id":14,
                "orth":"\u0432\u044b\u0441\u043e\u043a\u0430\u044f",
                "tag":"ADJ",
                "head":1,
                "dep":"amod"
              },
              {
                "id":15,
                "orth":"\u0441\u043c\u0435\u0440\u0442\u043d\u043e\u0441\u0442\u044c",
                "tag":"NOUN",
                "head":-3,
                "dep":"nsubj:pass"
              },
              {
                "id":16,
                "orth":"\u043e\u0442",
                "tag":"ADP",
                "head":1,
                "dep":"case"
              },
              {
                "id":17,
                "orth":"\u0440\u0430\u043a\u0430",
                "tag":"NOUN",
                "head":-2,
                "dep":"nmod"
              },
              {
                "id":18,
                "orth":",",
                "tag":"PUNCT",
                "head":1,
                "dep":"punct"
              },
              {
                "id":19,
                "orth":"\u0441\u043e\u043e\u0431\u0449\u0430\u0435\u0442",
                "tag":"VERB",
                "head":0,
                "dep":"ROOT"
              },
              {
                "id":20,
                "orth":"\u0420\u0418\u0410",
                "tag":"PROPN",
                "head":-1,
                "dep":"nsubj"
              },
              {
                "id":21,
                "orth":"\u041d\u043e\u0432\u043e\u0441\u0442\u0438",
                "tag":"PROPN",
                "head":-1,
                "dep":"appos"
              },
              {
                "id":22,
                "orth":".",
                "tag":"PUNCT",
                "head":-3,
                "dep":"punct"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":1,

The error message:

Traceback (most recent call last):
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/cli/train.py", line 248, in train
    for batch in util.minibatch_by_words(train_docs, size=batch_sizes):
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/util.py", line 535, in minibatch_by_words
    doc, gold = next(items)
  File "gold.pyx", line 217, in train_docs
  File "gold.pyx", line 233, in iter_gold_docs
  File "gold.pyx", line 253, in spacy.gold.GoldCorpus._make_golds
  File "gold.pyx", line 443, in spacy.gold.GoldParse.from_annot_tuples
  File "gold.pyx", line 593, in spacy.gold.GoldParse.__init__
ValueError: [E069] Invalid gold-standard parse tree. Found cycle between word IDs: {3, 5}

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/__main__.py", line 35, in <module>
    plac.call(commands[command], sys.argv[1:])
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/cli/train.py", line 368, in train
    best_model_path = _collate_best_model(meta, output_path, nlp.pipe_names)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/cli/train.py", line 431, in _collate_best_model
    path2str(best_component_src / component), path2str(best_dest / component)
TypeError: unsupported operand type(s) for /: 'NoneType' and 'str'

However when I try conversion from spacy's "simple" format to json via:

data_train_json = []
for text, annot in data_train_raw:
    try:
        doc = nlp(text)
        tags = biluo_tags_from_offsets(doc, annot['entities'])
        entities = spans_from_biluo_tags(doc, tags)
        doc.ents = entities
        data_train_json.append(doc)

everything, including training, goes fine.

The resulting json seems to be different (note raw tag and ner), but working:

[
  {
    "id":0,
    "paragraphs":[
      {
        "raw":"\u041e \u043f\u043e\u0434\u043e\u0440\u043e\u0436\u0430\u043d\u0438\u0438 \u043c\u044f\u0441\u0430 \u0437\u0430\u044f\u0432\u0438\u043b\u0438 72 \u043f\u0440\u043e\u0446\u0435\u043d\u0442\u0430 \u0440\u043e\u0441\u0441\u0438\u044f\u043d, \u0430 \u043d\u0430 \u043c\u043e\u043b\u043e\u0447\u043d\u044b\u0435 \u043f\u0440\u043e\u0434\u0443\u043a\u0442\u044b - 68 \u043f\u0440\u043e\u0446\u0435\u043d\u0442\u043e\u0432.",
        "sentences":[
          {
            "tokens":[
              {
                "id":0,
                "orth":"\u041e",
                "tag":"ADP___",
                "head":1,
                "dep":"case",
                "ner":"O"
              },
              {
                "id":1,
                "orth":"\u043f\u043e\u0434\u043e\u0440\u043e\u0436\u0430\u043d\u0438\u0438",
                "tag":"NOUN__Animacy=Inan|Case=Loc|Gender=Neut|Number=Sing",
                "head":2,
                "dep":"obl",
                "ner":"O"
              },
              {
                "id":2,
                "orth":"\u043c\u044f\u0441\u0430",
                "tag":"NOUN__Animacy=Inan|Case=Gen|Gender=Neut|Number=Sing",
                "head":-1,
                "dep":"nmod",
                "ner":"O"
              },
              {
                "id":3,
                "orth":"\u0437\u0430\u044f\u0432\u0438\u043b\u0438",
                "tag":"VERB__Aspect=Perf|Mood=Ind|Number=Plur|Tense=Past|VerbForm=Fin|Voice=Act",
                "head":0,
                "dep":"ROOT",
                "ner":"O"
              },

Question:
Is me converting through cli wrong and could it be cured?

Your Environment

  • Operating System: Ubuntu 18.04
  • Python Version Used: 3.7.6
  • spaCy Version Used: 2.1.9
  • Environment Information:
@adrianeboyd
Copy link
Contributor

The two conversions are doing very different things. The first one is converting tags and dependency parses from conllu to json (note that there's no NER info found by the converter -- it's likely the spacy 2.1.9 converter doesn't handle your particular conllu+NER format) and the second one is using automatically tagged and parsed docs with entities added from your annotation.

Which model(s) are you trying to train? Just an NER model? Also a tagger and a parser? Can you show a sample of your conllu data (at least one full sentence, anonymized if needed)?

@adrianeboyd adrianeboyd added feat / cli Feature: Command-line interface training Training and updating models more-info-needed This issue needs more information labels Jul 10, 2020
@sbushmanov
Copy link
Author

sbushmanov commented Jul 10, 2020

An excerpt from the source conllu file (the misbehaving sentence). Note, ner tag info (Tag=B-LOC eg) is present in the source file:

# sent_id = 0_2
# text = Вице-премьер напомнила, что главные факторы смертности в России — рак и болезни системы кровообращения.
1	Вице-премьер	_	NOUN	_	Animacy=Anim|Case=Nom|Gender=Masc|Number=Sing	2	nsubj	_	Tag=O
2	напомнила	_	VERB	_	Aspect=Perf|Gender=Fem|Mood=Ind|Number=Sing|Tense=Past|VerbForm=Fin|Voice=Act	0	root	_	Tag=O
3	,	_	PUNCT	_	_	11	punct	_	Tag=O
4	что	_	SCONJ	_	_	11	mark	_	Tag=O
5	главные	_	ADJ	_	Case=Nom|Degree=Pos|Number=Plur	6	amod	_	Tag=O
6	факторы	_	NOUN	_	Animacy=Inan|Case=Nom|Gender=Masc|Number=Plur	11	nsubj	_	Tag=O
7	смертности	_	NOUN	_	Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing	6	nmod	_	Tag=O
8	в	_	ADP	_	_	9	case	_	Tag=O
9	России	_	PROPN	_	Animacy=Inan|Case=Loc|Gender=Fem|Number=Sing	7	nmod	_	Tag=B-LOC
10	—	_	PUNCT	_	_	11	punct	_	Tag=O
11	рак	_	NOUN	_	Animacy=Inan|Case=Nom|Gender=Masc|Number=Sing	6	nsubj	_	Tag=O
12	и	_	CCONJ	_	_	13	cc	_	Tag=O
13	болезни	_	NOUN	_	Animacy=Inan|Case=Nom|Gender=Fem|Number=Plur	11	conj	_	Tag=O
14	системы	_	NOUN	_	Animacy=Inan|Case=Gen|Gender=Fem|Number=Sing	13	nmod	_	Tag=O
15	кровообращения	_	NOUN	_	Animacy=Inan|Case=Gen|Gender=Neut|Number=Sing	14	nmod	_	Tag=O
16	.	_	PUNCT	_	_	2	punct	_	Tag=O

The corresponding generated json part (without ner tag and causing cycle):

  {
    "id": 2,
    "paragraphs": [
      {
        "sentences": [
          {
            "tokens": [
              {
                "id": 0,
                "orth": "Вице-премьер",
                "tag": "NOUN",
                "head": 1,
                "dep": "nsubj"
              },
              {
                "id": 1,
                "orth": "напомнила",
                "tag": "VERB",
                "head": 0,
                "dep": "ROOT"
              },
              {
                "id": 2,
                "orth": ",",
                "tag": "PUNCT",
                "head": 8,
                "dep": "punct"
              },
              {
                "id": 3,
                "orth": "что",
                "tag": "SCONJ",
                "head": 7,
                "dep": "mark"
              },
              {
                "id": 4,
                "orth": "главные",
                "tag": "ADJ",
                "head": 1,
                "dep": "amod"
              },
              {
                "id": 5,
                "orth": "факторы",
                "tag": "NOUN",
                "head": 5,
                "dep": "nsubj"
              },
              {
                "id": 6,
                "orth": "смертности",
                "tag": "NOUN",
                "head": -1,
                "dep": "nmod"
              },
              {
                "id": 7,
                "orth": "в",
                "tag": "ADP",
                "head": 1,
                "dep": "case"
              },
              {
                "id": 8,
                "orth": "России",
                "tag": "PROPN",
                "head": -2,
                "dep": "nmod"
              },
              {
                "id": 9,
                "orth": "—",
                "tag": "PUNCT",
                "head": 1,
                "dep": "punct"
              },
              {
                "id": 10,
                "orth": "рак",
                "tag": "NOUN",
                "head": -5,
                "dep": "nsubj"
              },
              {
                "id": 11,
                "orth": "и",
                "tag": "CCONJ",
                "head": 1,
                "dep": "cc"
              },
              {
                "id": 12,
                "orth": "болезни",
                "tag": "NOUN",
                "head": -2,
                "dep": "conj"
              },
              {
                "id": 13,
                "orth": "системы",
                "tag": "NOUN",
                "head": -1,
                "dep": "nmod"
              },
              {
                "id": 14,
                "orth": "кровообращения",
                "tag": "NOUN",
                "head": -1,
                "dep": "nmod"
              },
              {
                "id": 15,
                "orth": ".",
                "tag": "PUNCT",
                "head": -14,
                "dep": "punct"
              }
            ]
          }
        ]
      }
    ]
  },

The training error message:

ValueError: [E069] Invalid gold-standard parse tree. Found cycle between word IDs: {2, 10, 5} (tokens: ',' 'рак' 'факторы') in the document starting with tokens: Вице-премьер напомнила , что главные факторы смертности в России — рак и болезни системы кровообращения ..

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/cli/train.py", line 570, in train
    exits=1,
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/wasabi/printer.py", line 90, in warn
    title, text, style=MESSAGES.WARN, show=show, spaced=spaced, exits=exits
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/wasabi/printer.py", line 236, in _get_msg
    title, text, color=style, icon=style, show=show, spaced=spaced, exits=exits
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/wasabi/printer.py", line 144, in text
    sys.exit(exits)
SystemExit: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/__main__.py", line 33, in <module>
    plac.call(commands[command], sys.argv[1:])
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/cli/train.py", line 615, in train
    best_model_path = _collate_best_model(final_meta, output_path, best_pipes)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/cli/train.py", line 674, in _collate_best_model
    path2str(best_component_src / component), path2str(best_dest / component)
TypeError: unsupported operand type(s) for /: 'NoneType' and 'str'

Actually neither spacy 2.3.1 nor 2.1.9 is able to convert to json so that model can train on resulting files.

I'm training just ner

@no-response no-response bot removed the more-info-needed This issue needs more information label Jul 10, 2020
@sbushmanov
Copy link
Author

sbushmanov commented Jul 11, 2020

I tried a different Russian corpus from here but still error messages while training:

spacy convert ru_syntagrus-ud-dev_w_xml.conllu ./ -c ner

spacy train ru /home/sergey/Py_Spacy_RU/test  /home/sergey/Py_Spacy_RU/data/syntagrus/ru_syntagrus-ud-train_w_xml.json /home/sergey/Py_Spacy_RU/data/syntagrus/ru_syntagrus-ud-test_w_xml.json --base-model /home/sergey/Py_Spacy_RU/ru2 --n-iter 20 --n-early-stopping 5 --pipeline 'ner'

Training pipeline: ['ner']
Starting with base model '/home/sergey/Py_Spacy_RU/ru2'
Counting training words (limit=0)

Itn    Dep Loss    NER Loss      UAS    NER P    NER R    NER F    Tag %  Token %  CPU WPS  GPU WPS
---  ----------  ----------  -------  -------  -------  -------  -------  -------  -------  -------
✔ Saved model to output directory                                                                                                      
/home/sergey/Py_Spacy_RU/test/model-final
⠙ Creating best model...
Traceback (most recent call last):
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/cli/train.py", line 257, in train
    losses=losses,
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/language.py", line 475, in update
    proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
  File "nn_parser.pyx", line 414, in spacy.syntax.nn_parser.Parser.update
  File "nn_parser.pyx", line 517, in spacy.syntax.nn_parser.Parser._init_gold_batch
  File "ner.pyx", line 106, in spacy.syntax.ner.BiluoPushDown.preprocess_gold
  File "ner.pyx", line 165, in spacy.syntax.ner.BiluoPushDown.lookup_transition
KeyError: "[E022] Could not find a transition with the name 'U-' in the NER model."

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/__main__.py", line 35, in <module>
    plac.call(commands[command], sys.argv[1:])
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/plac_core.py", line 328, in call
    cmd, result = parser.consume(arglist)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/plac_core.py", line 207, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/cli/train.py", line 368, in train
    best_model_path = _collate_best_model(meta, output_path, nlp.pipe_names)
  File "/home/sergey/anaconda3/envs/rasa/lib/python3.7/site-packages/spacy/cli/train.py", line 431, in _collate_best_model
    path2str(best_component_src / component), path2str(best_dest / component)
TypeError: unsupported operand type(s) for /: 'NoneType' and 'str'

Am I missing something or there is a bug?

@adrianeboyd
Copy link
Contributor

I'll close this issue since I think the underlying question is addressed in #5753.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 3, 2021

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / cli Feature: Command-line interface training Training and updating models
Projects
None yet
Development

No branches or pull requests

2 participants