-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low accuracy of POS tagger trained on Universal Dependency French corpus #827
Comments
It seems that the SpaCy's POS tagger training algorithm is broken.
|
Hi, Apologies that the process for this is still quite unpolished, and the docs aren't fully helpful. I haven't sat down with your script yet, but it looks to me like you're not iadding the words to the We can and will do better than this API --- but for now I think that might be what's wrong with your code. Matt |
Hi Matt, Thank you for your answer : your advises helped me a lot.
It learns very well but the finalization seems to break the model :
So I redefined my own
Now, the evaluation shows nice results that are comparable to LogisticRegression from scikit-learn :
It's a good start and it's interesting that the script also provides a dependency parser for French. Thanks again for your help. Thomas |
Can you share the end_training() function and/or explain what changes you made? |
Here is the code where the
And, this this how I simply redefined my own
|
Thank you Thomas, I'll play around this and next week. I hope I can help with something. |
@thomasgirault Hi, I've noticed you're using Spacy 1.5/1.6 to train the tagger. |
spaCy 1.7.0 (including @raphael0202's work on French tokenization) is now up! |
Sorry to come back to this but I'm running into the same issue (using the PTB as training data). Even with a handful of sentences, the sklearn classifier thomasgirault used above gives me decent values, while spacy never exceeds 39%. EDIT: my code @ https://gist.github.com/sjmielke/95ed4aeb7f5dee2dee1a7bf2092cb228 |
I've got the same problem @sjmielke does. Even when I'm trying to learn on english UD set, the accuracy floats around 33%. There's got to be something we're both doing wrong. Adding _ = vocab[word] didn't change anything. My code: https://gist.github.com/kamac/a7bc139f62488839a8118214a4d932f2 I'm using version 1.8.2 |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Your Environment
Hi,
I am trying to train a POS tagger for French.
I started by getting the Universal Dependencies corpus for French :
Then, I followed the SpaCy pos tagger tutorial ,
However, the results are quite disappointing because the accuracy obtained is low :
I saw the issue #773 about NER post training and I suppose it is maybe linked.
Is there any step I missed ?
Thanks by advance for your help,
Thomas
Here is the process I followed :
The text was updated successfully, but these errors were encountered: