Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Updating to Spacy-Nightly causing training issues #3142
How to reproduce the behaviour
Hello, I didn't have issues training the DependencyParser before with my code on v2.0, but after updating to v2.1.0a5, the model doesn't seem to train at all? The reason for updating was to try training on the fix for this issue without redefining the tensor since I have factories merging tokens before labeling dependencies. I was hoping this would increase accuracy. I have also downloaded and am using the appropriate en_core_web_md-2.1.0a5 model to see if having vectors will also increase labeling accuracy.
The adding labels/train data and training loop looks very similar to this example and I quickly generate gold objects for training with this:
My factories merge call looks something like this with the old workaround commented:
Should look something like this with multiple training data examples:
The model simply relabels all my labels to 'subtok' and the heads aren't trained. Am I supposed to be merging differently with the new update? Something to do with retokenize? Like I said, I was able to train just fine on v2.0, but after updating to Spacy-Nightly and updating the base model, the training doesn't perform anything? And thanks for everything you guys do.
Could you post the full training code? My guess is that you might be calling
You could try the new
Ah yes, I found my mistake. I added a blank parser before making my gold objects which required me to call
After moving that part of the code after gold object creation and moving nlp.begin_training() back to the training loop, the training works again (except my model predictions became much much worse for some reason). Thanks for the help!
I've run into another issue lately with training, which I hope it's okay to post here. I have been playing around with generating my own word vectors using Gensim word2vec. This is how I set the vectors:
Training the model with
However, I was also contemplating whether the better approach would be to keep the existing
And finally, is there a good way to preprocess data for training word vectors for spacy or a way to change the data representation for training and predictions? Currently my training data is limited and I have a lot of data like this:
But my issue with this is if I train with vectors like that, and then go process a document with those vectors/preprocessing, I will lose the original text/measurement. And if I process the document twice with and without the '#' preprocessing, the resulting doc objects might generate different parse trees/tokenizations and I won't be able to extract the original text that way either.
And sorry for having so many questions, I've been attempting for a while to make the dependency parser better for my domain without having to resort to manual parsing, but my model predictions have been poor even testing on some of the training data in which I have hundreds of, and multiple examples of similar structure that I'm trying to extract. If you have any advice on the above questions, that would help me out tremendously.
There's no way to do that, no. If you have an
spaCy does some subword features in its dependency parser training, so I wouldn't worry too much about this. Basically it has features for the word shape, prefix, suffix etc, so the vectors do generalise a bit. I wouldn't try to pre-process the text to lose information.