New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating to Spacy-Nightly causing training issues #3142

Closed
carlsonhang opened this Issue Jan 10, 2019 · 4 comments

Comments

Projects
None yet
2 participants
@carlsonhang
Copy link

carlsonhang commented Jan 10, 2019

How to reproduce the behaviour

Hello, I didn't have issues training the DependencyParser before with my code on v2.0, but after updating to v2.1.0a5, the model doesn't seem to train at all? The reason for updating was to try training on the fix for this issue without redefining the tensor since I have factories merging tokens before labeling dependencies. I was hoping this would increase accuracy. I have also downloaded and am using the appropriate en_core_web_md-2.1.0a5 model to see if having vectors will also increase labeling accuracy.

The adding labels/train data and training loop looks very similar to this example and I quickly generate gold objects for training with this:

gold_dep = []
for d, dmap in TRAIN_DEP:
    doc = nlp(d)
    gold_dep.append((doc, GoldParse(doc, deps=dmap['deps'], heads=dmap['heads'])))

My factories merge call looks something like this with the old workaround commented:

def __call__(self, doc):
        #doc.tensor = numpy.zeros((0,), dtype='float32')
        matches = self.matcher(doc)
        spans = []
        for matchid, start, end in matches:
            try:
                entity = Span(doc, start, end, label=matchid)
                spans.append(entity)
                for token in entity:
                    token._.set('is_pathology', True)
                doc.ents = list(doc.ents) + [entity]
            except:
                continue
        for span in spans:
            span.merge('NN',span[-1].lemma_,span.label_)
        return doc

Output:

Training parser...
Losses {'parser': 0.0}
Losses {'parser': 0.0}
Losses {'parser': 0.0}
Losses {'parser': 0.0}
Losses {'parser': 0.0}
Losses {'parser': 0.0}
Losses {'parser': 0.0}
Losses {'parser': 0.0}
Losses {'parser': 0.0}
Losses {'parser': 0.0}

Quick test:

Sentence: no evidence of hemorrhage
[('no', 'subtok', 1, 'evidence'), ('evidence', 'subtok', 2,  'of'), ('of', 'subtok', 3, 'hemorrhage'), ('hemorrhage', 'ROOT', 3, 'hemorrhage')]

Should look something like this with multiple training data examples:

[('no', 'negate', 3, 'hemorrhage'), ('evidence', 'nmod', 2, 'of'), ('of', 'prep', 3, 'hemorrhage'), ('hemorrhage', 'ROOT', 3, 'hemorrhage')]

The model simply relabels all my labels to 'subtok' and the heads aren't trained. Am I supposed to be merging differently with the new update? Something to do with retokenize? Like I said, I was able to train just fine on v2.0, but after updating to Spacy-Nightly and updating the base model, the training doesn't perform anything? And thanks for everything you guys do.

Your Environment

  • Operating System: Windows-10-10.0.16299-SP0
  • Python Version Used: 3.7.0
  • spaCy Version Used: 2.1.0a5
  • Environment Information:
@honnibal

This comment has been minimized.

Copy link
Member

honnibal commented Jan 21, 2019

Could you post the full training code? My guess is that you might be calling nlp.begin_training() which is resetting the weights. I'm not sure why the same thing wasn't happening before.

You could try the new nlp.resume_training() method, which might avoid the problem?

@carlsonhang

This comment has been minimized.

Copy link
Author

carlsonhang commented Jan 21, 2019

Ah yes, I found my mistake. I added a blank parser before making my gold objects which required me to call optimizer=nlp.begin_training() so I could use nlp(d) to make the doc object with my custom factories. Otherwise it would throw the TypeError: 'bool' object is not iterable error. My old code would then pass the optimizer to the training loop.

After moving that part of the code after gold object creation and moving nlp.begin_training() back to the training loop, the training works again (except my model predictions became much much worse for some reason). Thanks for the help!

@carlsonhang

This comment has been minimized.

Copy link
Author

carlsonhang commented Jan 21, 2019

I've run into another issue lately with training, which I hope it's okay to post here. I have been playing around with generating my own word vectors using Gensim word2vec. This is how I set the vectors:

vec_data, vec_keys, vec_shape = read_vectors('model.txt')
nlp = spacy.load('en_core_web_md', disable=['parser','ner'])
nlp.vocab.reset_vectors(shape=vec_shape)
for k,v in zip(vec_keys,vec_data):
    nlp.vocab.set_vector(k, v)
nlp.vocab.vectors.name = 'new_vectors'

Read_vectors() simply reads the vectors in from the file, I wrote a custom one because I'd like to train on bigrams/trigrams with spaces. After training and saving the model successfully, when I load the model for testing, I get this error:

Traceback (most recent call last):
  File "C:\Users\carlson.hang\Desktop\Code\DepTraining\Trainer\tools\test.py", line 82, in <module>
    main(model=model_dir) #trained model
  File "C:\Users\carlson.hang\Desktop\Code\DepTraining\Trainer\tools\test.py", line 37, in main
    nlp = spacy.load(model)
  File "C:\anaconda3\envs\nightly\lib\site-packages\spacy\__init__.py", line 22, in load
    return util.load_model(name, **overrides)
  File "C:\anaconda3\envs\nightly\lib\site-packages\spacy\util.py", line 121, in load_model
    return load_model_from_path(name, **overrides)
  File "C:\anaconda3\envs\nightly\lib\site-packages\spacy\util.py", line 160, in load_model_from_path
    return nlp.from_disk(model_path)
  File "C:\anaconda3\envs\nightly\lib\site-packages\spacy\language.py", line 767, in from_disk
    util.from_disk(path, deserializers, exclude)
  File "C:\anaconda3\envs\nightly\lib\site-packages\spacy\util.py", line 580, in from_disk
    reader(path / key)
  File "C:\anaconda3\envs\nightly\lib\site-packages\spacy\language.py", line 763, in <lambda>
    deserializers[name] = lambda p, proc=proc: proc.from_disk(p, vocab=False)
  File "pipeline.pyx", line 844, in spacy.pipeline.Tagger.from_disk
  File "C:\anaconda3\envs\nightly\lib\site-packages\spacy\util.py", line 580, in from_disk
    reader(path / key)
  File "pipeline.pyx", line 827, in spacy.pipeline.Tagger.from_disk.load_model
  File "pipeline.pyx", line 739, in spacy.pipeline.Tagger.Model
  File "C:\anaconda3\envs\nightly\lib\site-packages\spacy\_ml.py", line 483, in build_tagger_model
    pretrained_vectors=pretrained_vectors,
  File "C:\anaconda3\envs\nightly\lib\site-packages\spacy\_ml.py", line 296, in Tok2Vec
    glove = StaticVectors(pretrained_vectors, width, column=cols.index(ID))
  File "C:\anaconda3\envs\nightly\lib\site-packages\thinc\neural\_classes\static_vectors.py", line 43, in __init__
    vectors = self.get_vectors()
  File "C:\anaconda3\envs\nightly\lib\site-packages\thinc\neural\_classes\static_vectors.py", line 55, in get_vectors
    return get_vectors(self.ops, self.lang)
  File "C:\anaconda3\envs\nightly\lib\site-packages\thinc\extra\load_nlp.py", line 22, in get_vectors
    nlp = get_spacy(lang)
  File "C:\anaconda3\envs\nightly\lib\site-packages\thinc\extra\load_nlp.py", line 14, in get_spacy
    SPACY_MODELS[lang] = spacy.load(lang, **kwargs)
  File "C:\anaconda3\envs\nightly\lib\site-packages\spacy\__init__.py", line 22, in load
    return util.load_model(name, **overrides)
  File "C:\anaconda3\envs\nightly\lib\site-packages\spacy\util.py", line 122, in load_model
    raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model 'en_model.vectors'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

Training the model with en_core_web_sm doesn't produce this error, is there a way to work around this in case I need to append to the pretrained vectors for en_core_web_md?

However, I was also contemplating whether the better approach would be to keep the existing en_core_web_md vectors and set the vectors I do have, or reset all the vectors to my domain specific vocabulary?

And finally, is there a good way to preprocess data for training word vectors for spacy or a way to change the data representation for training and predictions? Currently my training data is limited and I have a lot of data like this: 2 x 3 cm or 2 x 3 x 4 cm. And, seemingly, it would be very ideal to change all numbers to # to make it # x # cm to train vectors and to train a parser model.

But my issue with this is if I train with vectors like that, and then go process a document with those vectors/preprocessing, I will lose the original text/measurement. And if I process the document twice with and without the '#' preprocessing, the resulting doc objects might generate different parse trees/tokenizations and I won't be able to extract the original text that way either.

And sorry for having so many questions, I've been attempting for a while to make the dependency parser better for my domain without having to resort to manual parsing, but my model predictions have been poor even testing on some of the training data in which I have hundreds of, and multiple examples of similar structure that I'm trying to extract. If you have any advice on the above questions, that would help me out tremendously.

@honnibal

This comment has been minimized.

Copy link
Member

honnibal commented Feb 21, 2019

is there a way to work around this in case I need to append to the pretrained vectors for en_core_web_md?

There's no way to do that, no. If you have an md or lg model, you need to use the same vectors at training and runtime --- otherwise the accuracy will be really bad, as the input will be very different. So you can only load your own vectors if you're using the sm model, or if you retrain entirely from scratch.

And finally, is there a good way to preprocess data for training word vectors for spacy or a way to change the data representation for training and predictions?

spaCy does some subword features in its dependency parser training, so I wouldn't worry too much about this. Basically it has features for the word shape, prefix, suffix etc, so the vectors do generalise a bit. I wouldn't try to pre-process the text to lose information.

@honnibal honnibal closed this Feb 21, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment