After Spacy 2.3 upgrade, command-line spacy train command keeps failing #5620

mbrunecky · 2020-06-20T22:55:15Z

How to reproduce the behaviour

I have been using Spacy 2.2.3 and the command below to train my models for almost a month. I decided to add GPU, and updated my Spacy installation using
pip install -D spacy[CUDA101]
This updated my install to Spacy 2.3 (I then repeatedly uninstall, pip --no-cache-dir install -U spacy[cuda101] - but that did not change the 'new' behavior)
I updated Spacy models (see environment below) AND pip install -U spacy-lookups-data

My training command now fails:

py -m spacy train en C:\Work\ML\Spacy\dataset\model C:\Work\ML\Spacy\dataset\train C:\Work\ML\Spacy\dataset\valid -v en_core_web_md
Training pipeline: ['tagger', 'parser', 'ner']
Starting with blank model 'en'
Loading vector from model 'en_core_web_md'
Traceback (most recent call last):
  File "C:\Program Files\Python\lib\runpy.py", line 193, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Program Files\Python\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Program Files\Python\lib\site-packages\spacy\__main__.py", line 33, in <module>
    plac.call(commands[command], sys.argv[1:])
  File "C:\Program Files\Python\lib\site-packages\plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "C:\Program Files\Python\lib\site-packages\plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "C:\Program Files\Python\lib\site-packages\spacy\cli\train.py", line 266, in train
    _load_vectors(nlp, vectors)
  File "C:\Program Files\Python\lib\site-packages\spacy\cli\train.py", line 645, in _load_vectors
    util.load_model(vectors, vocab=nlp.vocab)
  File "C:\Program Files\Python\lib\site-packages\spacy\util.py", line 170, in load_model
    return load_model_from_package(name, **overrides)
  File "C:\Program Files\Python\lib\site-packages\spacy\util.py", line 191, in load_model_from_package
    return cls.load(**overrides)
  File "C:\Program Files\Python\lib\site-packages\en_core_web_md\__init__.py", line 12, in load
    return load_model_from_init_py(__file__, **overrides)
  File "C:\Program Files\Python\lib\site-packages\spacy\util.py", line 235, in load_model_from_init_py
    return load_model_from_path(data_path, meta, **overrides)
  File "C:\Program Files\Python\lib\site-packages\spacy\util.py", line 216, in load_model_from_path
    component = nlp.create_pipe(factory, config=config)
  File "C:\Program Files\Python\lib\site-packages\spacy\language.py", line 309, in create_pipe
    return factory(self, **config)
  File "C:\Program Files\Python\lib\site-packages\spacy\language.py", line 1080, in factory
    return obj.from_nlp(nlp, **cfg)
  File "pipes.pyx", line 62, in spacy.pipeline.pipes.Pipe.from_nlp
  File "pipes.pyx", line 378, in spacy.pipeline.pipes.Tagger.__init__
TypeError: __init__() got multiple values for keyword argument 'vocab'

When trying to train using an existing model (this never worked in 2.2.3), I get:

py -m spacy train en C:\Work\ML\Spacy\dataset\model  C:\Work\ML\Spacy\dataset\train C:\Work\ML\Spacy\dataset\valid -m en_core_web_md"

✘ Can't find model meta.json
en_core_web_md

(the file is there, in C:\Program Files\Python\Lib\site-packages\en_core_web_md\meta.json)

Your Environment

Operating System: Windows 10 (update 2004)
Python Version Used: Python 3.8.1
spaCy Version Used: 2.3.0 (fresh update from 2.2.3)
Environment Information:
Installation performed as Administrator, NO venv
py -m spacy validate
←[2K✔ Loaded compatibility table
←[1m
====================== Installed models (spaCy v2.3.0) ======================←[0m
ℹ spaCy installation: C:\Program Files\Python\lib\site-packages\spacy

TYPE NAME MODEL VERSION
package en-core-web-sm en_core_web_sm 2.3.0 ✔
package en-core-web-md en_core_web_md 2.3.0 ✔
package en-core-web-lg en_core_web_lg 2.3.0 ✔

The text was updated successfully, but these errors were encountered:

mbrunecky · 2020-06-22T00:32:36Z

A minor correction:
My second complaint:
py -m spacy train en C:\Work\ML\Spacy\dataset\model C:\Work\ML\Spacy\dataset\train C:\Work\ML\Spacy\dataset\valid -m en_core_web_md
was INCORRECT.
I intended to use 'base' model -b en_core_web_md (got the option name confused). However, using en_core_web_md (still) ends in an error, which I though has been fixed in 2.3 (per error discussion).
I train for NER using two entity names (NAME_FROM, NAME_TO) and with -b en_core_web_md I (still) get:

Traceback (most recent call last):
  File "C:\Program Files\Python\lib\site-packages\spacy\cli\train.py", line 425, in train
    nlp.update(
  File "C:\Program Files\Python\lib\site-packages\spacy\language.py", line 526, in update
    proc.update(docs, golds, sgd=get_grads, losses=losses, **kwargs)
  File "nn_parser.pyx", line 446, in spacy.syntax.nn_parser.Parser.update
  File "nn_parser.pyx", line 548, in spacy.syntax.nn_parser.Parser._init_gold_batch
  File "ner.pyx", line 107, in spacy.syntax.ner.BiluoPushDown.preprocess_gold
  File "ner.pyx", line 165, in spacy.syntax.ner.BiluoPushDown.lookup_transition
KeyError: "[E022] Could not find a transition with the name 'B-NAME_FROM' in the NER model."

So the only option that seems to allow me to continue with Spacy using vectors is using one of my older (Spacy 2.2.3) models as 'base', and keep re-training them. However, the message Spacy is giving me is not very encouraging:
UserWarning: [W031] Model 'en_model' (0.0.0) requires spaCy v2.2 and is incompatible with the current spaCy version (2.3.0). This may lead to unexpected results or runtime errors. To resolve this, download a newer compatible model or retrain your custom model with the current spaCy version. For more details and available updates, run: python -m spacy validate

adrianeboyd · 2020-06-22T06:32:01Z

Thanks for the report, the train command error with -v does look like a bug. If you want to go back to v2.2 in the meanwhile, you can install v2.2 with cupy like this:

pip install spacy[cuda101]==2.2.4

The warning above is correct: v2.2 and v2.3 models aren't compatible, so you need to be sure the models you're using match the spacy version you're using.

adrianeboyd · 2020-06-22T07:07:50Z

This looks like an unintended side effect of #5374.

mbrunecky · 2020-06-22T21:12:15Z

Thank you for suggestion to revert to Spacy 2.2.4:
pip install spacy[cuda101]==2.2.4
I just wonder if re-installing cuda will not 'break' it again, as after installing spacy[cuda101] I had to remove cuda and reinstall it (per another bug report).

That said, I would prefer NOT to have to revert to 2.2.4. I was able to re-train one of my 2.1.3 models (using it as 'base'). My hope was 2.3 fixed the memory leak in beam-search. And in 2.3 it seems better, though still leaking - see bug ... (I need NER prediction confidence, and I really can not re-load the model in prediction server every 500 or so requests).
However, with -v option broken, my only option in 2.3 is using -b (base) model, which has another problem: re-training can NOT introduce any new entity types; when I do so, I get the old error:

File "ner.pyx", line 165, in spacy.syntax.ner.BiluoPushDown.lookup_transition
KeyError: "[E022] Could not find a transition with the name 'B-NAME_FROM' in the NER model."

Since bug 3782 has been closed, should I submit it again? Or is it an enhancement request "please allow adding new entity(ies) in re-training a base model"?

adrianeboyd · 2020-06-23T12:09:08Z

You shouldn't have to modify your CUDA/cupy installation when upgrading spacy. It should be relatively independent. You can uninstall cupy-cuda101 and spacy will then just run on CPU with numpy, and then reinstall cupy-cuda101 and spacy will detect it again if you enable the GPU.

The beam search memory leak should be fixed in 2.2.4.

If you'd like to use 2.3.0, you can apply this very short patch and install spacy from source (or just edit this one file in your spacy install) to fix the -v option: https://github.com/explosion/spaCy/pull/5624/files

If you'd like to use the train CLI -b option with new labels, you can load the model, add the new labels with nlp.get_pipe("ner").add_label("LABEL"), save it with nlp.to_disk(), and then use that as the base model. It's hard for the train CLI to cover every possible training scenario (it's already too complicated to be honest), but you can make a copy of spacy/cli/train.py, adjust the imports, and use it as an independent script so it's easy to edit however you need.

mbrunecky · 2020-06-23T21:44:04Z

Thank you. I will try to 'patch' the 2.3 ... the fix seems pretty simple/obvious.
That way I can keep using the vectors from Spay models such as en_core_web_md.

I got the GPU working, but I hoped for a better performance boost (but I did not spend that much on GPU either). Perhaps I can improve it by some parameter tuning. Right now Spacy reports about 30,000 GPU WPS, but the GPU utilization is very low, about 20% CUDA use on my GeForce GPX 1660 card. Looks like I will get 2x the speed...

The beam search memory leak should be fixed in 2.2.4.
Looking at my server, after loading model ~1.2 GB, after making 4000 page-predictions it goes to ~2.7 GB. It is better than in 2.2.3 but still unacceptable. Is there a bug # for this problem?

And thank you very much for the advice on adding the labels. CLI 'train' too complicated? I guess 'complicated' is a relative term. I may try it, though I do not trust my Python skills yet. Perhaps, almost 20 years ago, when my co-worker Mark Lutz was enthusiastically writing the first Python book, I should have joined him :-).

adrianeboyd · 2020-06-24T08:12:08Z

In terms of the memory usage, maybe #5083 is relevant? The vocab in the model is not static and the memory usage will grow to some extent as you use on texts with tokens it hasn't seen before. If the memory usage doesn't look like it's explained by this, you can open a new issue with a minimal example that demonstrates the problem and we'll try to look into it. It could be related to the tee problem mentioned in #5083?

Most of the development has focused on efficient CPU implementations and the cupy/GPU implementation is not particularly efficient at this point. I think I normally see about a 2-4x difference between CPU and GPU, but it'll depend on your system. It's probably not going to be a huge difference.

With the train CLI there are so many possible combinations of options that it's hard to keep everything tested sufficiently. There are several configurations that we test thoroughly when training the provided models (we also use the train CLI directly for internal training), but some like this -v bug still end up falling through the cracks. The current train CLI is going to be replaced with training from config files in spacy v3, which should hopefully be easier to maintain.

mbrunecky · 2020-06-25T16:42:34Z

Thank you very much. With regards to memory usage, the #5083 is relevant. My data is from OCR documents, and there is an infinite amount of ‘garbage’ (noise, unreadable words) and ‘misspellings’ (words such as “inthe”). So my vocabulary is almost infinite. My idea is replacing the ‘garbage’ in data with some special token, like ‘…’ to indicate there is an unknown text – though I doubt I know enough about Spacy and ML to choose the ‘right’ approach . We also try ‘fixing’ the ‘misspellings’ but that has been an uphill battle (perhaps another AI ML project ☺). Hence adding nlp.vocab.strings._reset_and_load(minimal_strings) makes sense, and I would expect it to be much faster than re-loading the entire model. I am not familiar with the iterator.tee() problem, but the prediction code using beam search is NOT thread safe. Using it in multi-threaded server is out of question. Further with regards to memory usage, I see (a smaller but steady) memory utilization increase even when I repeatedly (1000 times) request prediction for the same phrase (using beam width up to 16). I will try to reproduce it in a small unit test and submit as a bug. Finally, thanks again the –v patch. Works like a champ. And switching from command line options to config file(s) makes a lot of sense. That could include the environmental variables . I have learned to use Java Properties (key=value pairs, i.e. persistent dictionary) a LOT… From: Adriane Boyd [mailto:notifications@github.com] Sent: Wednesday, June 24, 2020 2:12 AM To: explosion/spaCy <spaCy@noreply.github.com> Cc: Martin Brunecky <Martin.Brunecky@kofile.us>; Author <author@noreply.github.com> Subject: Re: [explosion/spaCy] After Spacy 2.3 upgrade, command-line spacy train command keeps failing (#5620) In terms of the memory usage, maybe #5083<#5083> is relevant? The vocab in the model is not static and the memory usage will grow to some extent as you use on texts with tokens it hasn't seen before. If the memory usage doesn't look like it's explained by this, you can open a new issue with a minimal example that demonstrates the problem and we'll try to look into it. It could be related to the tee problem mentioned in #5083<#5083>? Most of the development has focused on efficient CPU implementations and the cupy/GPU implementation is not particularly efficient at this point. I think I normally see about a 2-4x difference between CPU and GPU, but it'll depend on your system. It's probably not going to be a huge difference. With the train CLI there are so many possible combinations of options that it's hard to keep everything tested sufficiently. There are several configurations that we test thoroughly when training the provided models (we also use the train CLI directly for internal training), but some like this -v bug still end up falling through the cracks. The current train CLI is going to be replaced with training from config files in spacy v3, which should hopefully be easier to maintain. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#5620 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AOGQRXMFQIQIY4VMZADQYRLRYGYONANCNFSM4ODSORTA>.

github-actions · 2021-11-04T00:02:03Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

svlandeg added feat / cli Feature: Command-line interface training Training and updating models labels Jun 21, 2020

adrianeboyd added the bug Bugs and behaviour differing from documentation label Jun 22, 2020

adrianeboyd mentioned this issue Jun 22, 2020

Skip vocab in component config overrides #5624

Merged

3 tasks

svlandeg closed this as completed in #5624 Jun 23, 2020

viaregio mentioned this issue Jul 1, 2020

2.3 less accurate than 2.2.x #5662

Closed

github-actions bot locked as resolved and limited conversation to collaborators Nov 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After Spacy 2.3 upgrade, command-line spacy train command keeps failing #5620

After Spacy 2.3 upgrade, command-line spacy train command keeps failing #5620

mbrunecky commented Jun 20, 2020

mbrunecky commented Jun 22, 2020

adrianeboyd commented Jun 22, 2020

adrianeboyd commented Jun 22, 2020

mbrunecky commented Jun 22, 2020

adrianeboyd commented Jun 23, 2020

mbrunecky commented Jun 23, 2020

adrianeboyd commented Jun 24, 2020

mbrunecky commented Jun 25, 2020 via email

github-actions bot commented Nov 4, 2021

After Spacy 2.3 upgrade, command-line spacy train command keeps failing #5620

After Spacy 2.3 upgrade, command-line spacy train command keeps failing #5620

Comments

mbrunecky commented Jun 20, 2020

How to reproduce the behaviour

Your Environment

mbrunecky commented Jun 22, 2020

adrianeboyd commented Jun 22, 2020

adrianeboyd commented Jun 22, 2020

mbrunecky commented Jun 22, 2020

adrianeboyd commented Jun 23, 2020

mbrunecky commented Jun 23, 2020

adrianeboyd commented Jun 24, 2020

mbrunecky commented Jun 25, 2020 via email

github-actions bot commented Nov 4, 2021