Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low accuracy of POS tagger trained on Universal Dependency French corpus #827

Closed
thomasgirault opened this issue Feb 14, 2017 · 11 comments
Closed
Labels
docs Documentation and website usage General spaCy usage

Comments

@thomasgirault
Copy link

Your Environment

  • Operating System: Ubuntu 16.04
  • Python Version Used: Python 3.5 (Miniconda)
  • spaCy Version Used: 1.6.0 & 1.5.0

Hi,

I am trying to train a POS tagger for French.
I started by getting the Universal Dependencies corpus for French :
Then, I followed the SpaCy pos tagger tutorial ,
However, the results are quite disappointing because the accuracy obtained is low :

             precision    recall  f1-score   support
        ADJ       0.28      0.22      0.24       517
        ADP       0.30      0.43      0.35       713
        ADV       0.10      0.36      0.16        98
        AUX       0.09      0.15      0.11       100
       CONJ       0.03      0.07      0.05        87
        DET       0.66      0.39      0.49      1719
       INTJ       0.00      0.00      0.00         0
       NOUN       0.62      0.74      0.67      1054
        NUM       0.12      0.21      0.15        82
       PART       0.06      0.08      0.07        51
       PRON       0.17      0.21      0.19       354
      PROPN       0.39      0.16      0.23       728
      PUNCT       0.35      0.42      0.38       713
      SCONJ       0.19      0.27      0.22        79
        SYM       0.13      0.27      0.18        11
       VERB       0.37      0.37      0.37       713
          X       0.00      0.00      0.00         1
avg / total       0.44      0.39      0.39      7020

I saw the issue #773 about NER post training and I suppose it is maybe linked.
Is there any step I missed ?
Thanks by advance for your help,

Thomas

Here is the process I followed :

$ head  ./UD_French/fr-ud-train.conllu

    # sentid: fr-ud-train_00001
    # sentence-text: Les commotions cérébrales sont devenu si courantes dans ce sport qu'on les considére presque comme la routine.
    1	Les	le	DET	_	Definite=Def|Gender=Fem|Number=Plur|PronType=Art	2	det	_	_
    2	commotions	commotion	NOUN	_	Gender=Fem|Number=Plur	5	nsubj	_	_
    3	cérébrales	cérébral	ADJ	_	Gender=Fem|Number=Plur	2	amod	_	_
    4	sont	être	AUX	_	Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin	5	aux	_	_
    5	devenu	devenir	VERB	_	Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part	0	root	_	_
    6	si	si	ADV	_	_	7	advmod	_	_
    7	courantes	courant	ADJ	_	Gender=Fem|Number=Plur	5	xcomp	_	_
    8	dans	dans	ADP	_	_	10	case	_	_


    import random
    from collections import Counter

    from pathlib import Path
    from spacy.vocab import Vocab
    from spacy.tagger import Tagger
    from spacy.tokens import Doc
    from spacy.gold import GoldParse

    from sklearn.metrics import accuracy_score, classification_report


    TAG_MAP = {
        'X':{'pos':"X"},
        'NOUN':{'pos':"NOUN"},
        'DET':{'pos':"DET"},
        'ADV':{'pos':"ADV"},
        'AUX':{'pos':"AUX"},
        'PRON':{'pos':"PRON"},
        'SCONJ':{'pos':"SCONJ"},
        'CONJ':{'pos':"CONJ"},
        'VERB':{'pos':"VERB"},
        'PROPN':{'pos':"PROPN"},
        'PUNCT':{'pos':"PUNCT"},
        'INTJ':{'pos':"INTJ"},
        'ADJ':{'pos':"ADJ"},
        'ADP':{'pos':"ADP"},
        'NUM':{'pos':"NUM"},
        'SYM':{'pos':"SYM"},
        'PART':{'pos':"PART"}
    }


    def gen_corpus(path):
        doc = []
        tagset = set()
        with open(path) as file:
            for line in file:
                if line[0].isdigit():
                    features = line.split()
                    word, pos= features[1], features[3]
                    if pos != "_":
                        tagset.add(pos)
                        doc.append((word, pos)) 
                elif len(line.strip()) == 0:
                    if len(doc) > 0:
                        words, tags = zip(*doc)
                        yield (list(words), list(tags))
                    doc = []


    def evaluation(TEST_DATA):
        counter = Counter()
        y_pred, y_true = [], []
        for words, tags in TEST_DATA:
            doc = Doc(vocab, words=words)
            tagger(doc)
            for i, word in enumerate(doc):
                counter[word.pos_ == tags[i]] += 1
                y_pred.append(word.pos_)
                y_true.append(tags[i])
        print(counter)
        return y_pred, y_true


    def ensure_dir(path):
        if not path.exists():
            path.mkdir()


    def gen_tagger(TRAIN_DATA, output_dir=None):
        if output_dir is not None:
            output_dir = Path(output_dir)
            ensure_dir(output_dir)
            ensure_dir(output_dir / "pos")
            ensure_dir(output_dir / "vocab")

        vocab = Vocab(tag_map=TAG_MAP)
        tagger = Tagger(vocab)
        for i in range(50):
            print(i)
            for words, tags in TRAIN_DATA:
                doc = Doc(vocab, words=words)
                gold = GoldParse(doc, tags=tags)
                tagger.update(doc, gold)
            random.shuffle(TRAIN_DATA)
        # tagger.model.end_training()
        
        if output_dir is not None:
            tagger.model.dump(str(output_dir / 'pos' / 'model'))
        with (output_dir / 'vocab' / 'strings.json').open('w') as file_:
            tagger.vocab.strings.dump(file_)

        return tagger, vocab

if __name__ == '__main__':

    train_path = "./UD_French/fr-ud-train.conllu" 
    test_path = "./UD_French/fr-ud-test.conllu" 

    TRAIN_DATA = list(gen_corpus(path))
    TEST_DATA = list(gen_corpus(test_path))
    tagger, vocab = gen_tagger(TRAIN_DATA, "./spacy_postagger")
    y_pred, y_true = evaluation(TEST_DATA)
    classification_report(y_pred, y_true)
@thomasgirault thomasgirault changed the title Low accuracy of POS tagging trained on Universal Dependency French corpus Low accuracy of POS tagger trained on Universal Dependency French corpus Feb 15, 2017
@thomasgirault
Copy link
Author

It seems that the SpaCy's POS tagger training algorithm is broken.
This does not seem to be a data issue : a quick try with a tagger trained with scikit-learn LogisticRegression with the same corpus starts to show interesting results.

             precision    recall  f1-score   support

        ADJ       0.88      0.90      0.89       393
        ADP       1.00      0.98      0.99      1032
        ADV       0.96      0.95      0.95       339
        AUX       0.93      0.82      0.87       187
       CONJ       0.99      1.00      1.00       174
        DET       0.99      0.98      0.98      1034
       INTJ       0.57      0.57      0.57         7
       NOUN       0.93      0.95      0.94      1228
        NUM       0.88      0.95      0.91       136
       PART       0.97      1.00      0.99        68
       PRON       0.93      0.98      0.95       419
      PROPN       0.91      0.79      0.85       345
      PUNCT       1.00      1.00      1.00       840
      SCONJ       0.90      0.93      0.92       106
        SYM       0.87      0.91      0.89        22
       VERB       0.91      0.94      0.93       684
          X       0.17      0.17      0.17         6

avg / total       0.95      0.95      0.95      7020

from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
​
def features(sentence, index):
    """ sentence: [w1, w2, ...], index: the index of the word """
    return {
        'word': sentence[index],
        'is_first': index == 0,
        'is_last': index == len(sentence) - 1,
        'is_capitalized': sentence[index][0].upper() == sentence[index][0],
        'is_all_caps': sentence[index].upper() == sentence[index],
        'is_all_lower': sentence[index].lower() == sentence[index],
        'prefix-1': sentence[index][0],
        'prefix-2': sentence[index][:2],
        'prefix-3': sentence[index][:3],
        'suffix-1': sentence[index][-1],
        'suffix-2': sentence[index][-2:],
        'suffix-3': sentence[index][-3:],
        'prev_word': '<s>' if index == 0 else sentence[index - 1],
        'next_word': '</s>' if index == len(sentence) - 1 else sentence[index + 1],
        'has_hyphen': '-' in sentence[index],
        'is_numeric': sentence[index].isdigit(),
        'capitals_inside': sentence[index][1:].lower() != sentence[index][1:]
    }
​
​
def untag(tagged_sentence):
    return [w for w, t in tagged_sentence]
 
train_path = "./UD_French/fr-ud-train.conllu" 
test_path = "./UD_French/fr-ud-test.conllu" 
​
def gen_corpus(path):
    doc = []
    with open(path) as file:
        for line in file:
            if line[0].isdigit():
                features = line.split()
                word, pos= features[1], features[3]
                if pos != "_":
                    doc.append((word, pos)) 
            elif len(line.strip()) == 0:
                if len(doc) > 0:
                    words, tags = zip(*doc)
                    yield (list(words), list(tags))
                doc = []
​
def evaluation(TEST_DATA):
    y_pred, y_true = [], []
    for words, tags in TEST_DATA:
        for i, (word, pos) in enumerate(tag(words)):
            y_pred.append(pos)
            y_true.append(tags[i])
    return y_pred, y_true
                
                
def transform_to_dataset(tagged_sentences):
    X, y = [], []
 
    for words, tags in tagged_sentences:
        for index, word  in enumerate(words):
            X.append(features(words, index))
            y.append(tags[index])
    return X, y
 
        
def pos_tag(sentence):
    tags = clf.predict([features(sentence, index) for index in range(len(sentence))])
    return zip(sentence, tags)

training_sentences = list(gen_corpus(train_path))
test_sentences = list(gen_corpus(test_path))
 
print( len(training_sentences))   # 14554
print( len(test_sentences)     )    # 298
​
X, y = transform_to_dataset(training_sentences)
print(len(X)) #  nb_features : 356419

clf = Pipeline([
    ('vectorizer', DictVectorizer(sparse=True)),
    ('classifier',  LogisticRegression(n_jobs=4, max_iter=200, verbose=True))
])
clf.fit(X, y)

X_test, y_test = transform_to_dataset(test_sentences)
print( "Accuracy:", clf.score(X_test, y_test)) # Accuracy: 0.951851851852

# test
sents = ["je voudrais réserver un hotel près de Rennes",
         "comment aller à Paris depuis Rennes ?",
         "où est l' Intermarché le plus proche ?", 
        "jusqu' à quelle heure la piscine Saint-Georges est ouverte aujourd ' hui ?"]
for s in sents:
    for w, pos in pos_tag(s.split()):
        print("%s/%s" % (w, pos), end=' ')
    print()
​
    je/PRON voudrais/AUX réserver/VERB un/DET hotel/NOUN près/ADV de/ADP Rennes/PROPN 

    comment/ADV aller/VERB à/ADP Paris/PROPN depuis/ADP Rennes/PROPN ?/PUNCT 

    où/PRON est/VERB l'/DET Intermarché/PROPN le/DET plus/ADV proche/ADJ ?/PUNCT 

    jusqu'/ADP à/ADP quelle/DET heure/NOUN la/DET piscine/NOUN Saint-Georges/PROPN est/AUX ouverte/VERB aujourd/ADJ '/PUNCT hui/ADV ?/PUNCT 

y_true, y_pred = evaluation(test_sentences)
for l in classification_report(y_true, y_pred).split('\n'):
    print(l)

             precision    recall  f1-score   support

        ADJ       0.88      0.90      0.89       393
        ADP       1.00      0.98      0.99      1032
        ADV       0.96      0.95      0.95       339
        AUX       0.93      0.82      0.87       187
       CONJ       0.99      1.00      1.00       174
        DET       0.99      0.98      0.98      1034
       INTJ       0.57      0.57      0.57         7
       NOUN       0.93      0.95      0.94      1228
        NUM       0.88      0.95      0.91       136
       PART       0.97      1.00      0.99        68
       PRON       0.93      0.98      0.95       419
      PROPN       0.91      0.79      0.85       345
      PUNCT       1.00      1.00      1.00       840
      SCONJ       0.90      0.93      0.92       106
        SYM       0.87      0.91      0.89        22
       VERB       0.91      0.94      0.93       684
          X       0.17      0.17      0.17         6

avg / total       0.95      0.95      0.95      7020

@honnibal
Copy link
Member

Hi,

Apologies that the process for this is still quite unpolished, and the docs aren't fully helpful.

I haven't sat down with your script yet, but it looks to me like you're not iadding the words to the vocab, and not saving the vocab words. Could you compare with the script in bin/parser/train_ud.py? I think you'll see that there's a step where _ = nlp.vocab[word] is performed, so that the vocab is populated prior to training.

We can and will do better than this API --- but for now I think that might be what's wrong with your code.

Matt

@ines ines added docs Documentation and website usage General spaCy usage labels Feb 15, 2017
@thomasgirault
Copy link
Author

Hi Matt,

Thank you for your answer : your advises helped me a lot.
I simply tried the script in bin/parser/train_ud.py :

$ python train_ud.py ./UD_French/fr-ud-train.conllu ./UD_French/fr-ud-dev.conllu ./spacy_french

It learns very well but the finalization seems to break the model :

0:      78.420  92.605
1:      80.818  94.582
2:      81.394  93.439
....
14:     86.510  95.714
# end_training()
14:     28.952  0.352   0.101

So I redefined my own end_training() function to avoid model averaging during finalization.
The model then keeps its weights for future reuse :

0:      78.420  92.605
1:      80.265  93.232
2:      82.223  94.984
...
14:     86.466  96.083
# end_training()
14:     86.466  82.892  96.083

Now, the evaluation shows nice results that are comparable to LogisticRegression from scikit-learn :

             precision    recall  f1-score   support

        ADJ       0.85      0.90      0.87       381
        ADP       0.99      0.99      0.99      1015
        ADV       0.95      0.90      0.93       352
        AUX       0.90      0.87      0.89       171
       CONJ       0.99      1.00      0.99       173
        DET       0.99      0.98      0.98      1029
       INTJ       0.57      0.50      0.53         8
       NOUN       0.93      0.95      0.94      1226
        NUM       0.89      0.94      0.91       140
       PART       0.97      1.00      0.99        68
       PRON       0.97      0.95      0.96       450
      PROPN       0.90      0.82      0.86       328
      PUNCT       1.00      1.00      1.00       839
      SCONJ       0.84      0.94      0.88        98
        SYM       0.91      0.84      0.87        25
       VERB       0.93      0.92      0.92       708
          X       0.17      0.11      0.13         9

avg / total       0.95      0.95      0.95      7020

It's a good start and it's interesting that the script also provides a dependency parser for French.
I will evaluate this in a couple of weeks (after holidays) and I will also give a try to lemmatisation and word vectors : it could maybe help to contribute to French language in SpaCy.

Thanks again for your help.

Thomas

@davidvartanian
Copy link

Can you share the end_training() function and/or explain what changes you made?
I need to do the same in many other languages, so I could also contribute with more support for other languages to spaCy.

@thomasgirault
Copy link
Author

Here is the code where the Language.end_training() method is replaced in bin/parser/train_ud.py (lines [112,113]) :

def main(train_loc, dev_loc, model_dir, tag_map_loc=None):
    ...
    end_training(model_dir, vocab=vocab, tagger=tagger, parser=parser)
    # nlp = Language(vocab=vocab, tagger=tagger, parser=parser)
    # nlp.end_training(model_dir)

And, this this how I simply redefined my own end_training() function to avoid model averaging during finalization with the original method :

# adapted from spacy.language.end_training (lines [357,402])
def end_training(path, vocab, tagger, parser):
    if isinstance(path, str):
        path = pathlib.Path(path)

    if tagger:
        # disable the tagger end_training()      
        # tagger.model.end_training()
        tagger.model.dump(str(path / 'pos' / 'model'))
    if parser:
        # disable the parser end_training()      
        # parser.model.end_training()
        parser.model.dump(str(path / 'deps' / 'model'))
    strings_loc = path / 'vocab' / 'strings.json'
    with strings_loc.open('w', encoding='utf8') as file_:
        vocab.strings.dump(file_)
    vocab.dump(path / 'vocab' / 'lexemes.bin')

    if tagger:
        tagger_freqs = list(tagger.freqs[TAG].items())
    else:
        tagger_freqs = []
    if parser:
        dep_freqs = list(parser.moves.freqs[DEP].items())
        head_freqs = list(parser.moves.freqs[HEAD].items())
    else:
        dep_freqs = []
        head_freqs = []

    # disable NER stuff
    with (path / 'vocab' / 'serializer.json').open('w') as file_:
        file_.write(
            json.dumps([
                (TAG, tagger_freqs),
                (DEP, dep_freqs),
                # NER TAGS DISABLED HERE
                (HEAD, head_freqs)
            ]))

@davidvartanian
Copy link

Thank you Thomas, I'll play around this and next week. I hope I can help with something.

@ines ines added this to the Improve training API milestone Feb 18, 2017
@thomasgirault thomasgirault reopened this Feb 23, 2017
@raphael0202
Copy link
Contributor

raphael0202 commented Mar 18, 2017

@thomasgirault Hi, I've noticed you're using Spacy 1.5/1.6 to train the tagger.
I've work on spacy tokenization for French (it is available on master, and should be available in the next release), and it should (hopefully) improve the tagger accuracy.

@ines
Copy link
Member

ines commented Mar 18, 2017

spaCy 1.7.0 (including @raphael0202's work on French tokenization) is now up!

@ines ines closed this as completed Mar 18, 2017
@sjmielke
Copy link

sjmielke commented May 3, 2017

Sorry to come back to this but I'm running into the same issue (using the PTB as training data). Even with a handful of sentences, the sklearn classifier thomasgirault used above gives me decent values, while spacy never exceeds 39%.
Calling _ = vocab[w] for all words in my trainset does not seem to improve things, neither does removing tagger.model.end_training() (in fact, then I can't even reach 30%).
Is there anything else I'm missing or is this broken?

EDIT: my code @ https://gist.github.com/sjmielke/95ed4aeb7f5dee2dee1a7bf2092cb228

@kamac
Copy link

kamac commented Jul 5, 2017

I've got the same problem @sjmielke does. Even when I'm trying to learn on english UD set, the accuracy floats around 33%. There's got to be something we're both doing wrong.

Adding _ = vocab[word] didn't change anything.

My code: https://gist.github.com/kamac/a7bc139f62488839a8118214a4d932f2

I'm using version 1.8.2

@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
docs Documentation and website usage General spaCy usage
Projects
None yet
Development

No branches or pull requests

7 participants