Low accuracy of POS tagger trained on Universal Dependency French corpus #827

thomasgirault · 2017-02-14T16:39:37Z

Your Environment

Operating System: Ubuntu 16.04
Python Version Used: Python 3.5 (Miniconda)
spaCy Version Used: 1.6.0 & 1.5.0

Hi,

I am trying to train a POS tagger for French.
I started by getting the Universal Dependencies corpus for French :
Then, I followed the SpaCy pos tagger tutorial ,
However, the results are quite disappointing because the accuracy obtained is low :

             precision    recall  f1-score   support
        ADJ       0.28      0.22      0.24       517
        ADP       0.30      0.43      0.35       713
        ADV       0.10      0.36      0.16        98
        AUX       0.09      0.15      0.11       100
       CONJ       0.03      0.07      0.05        87
        DET       0.66      0.39      0.49      1719
       INTJ       0.00      0.00      0.00         0
       NOUN       0.62      0.74      0.67      1054
        NUM       0.12      0.21      0.15        82
       PART       0.06      0.08      0.07        51
       PRON       0.17      0.21      0.19       354
      PROPN       0.39      0.16      0.23       728
      PUNCT       0.35      0.42      0.38       713
      SCONJ       0.19      0.27      0.22        79
        SYM       0.13      0.27      0.18        11
       VERB       0.37      0.37      0.37       713
          X       0.00      0.00      0.00         1
avg / total       0.44      0.39      0.39      7020

I saw the issue #773 about NER post training and I suppose it is maybe linked.
Is there any step I missed ?
Thanks by advance for your help,

Thomas

Here is the process I followed :

$ head  ./UD_French/fr-ud-train.conllu

    # sentid: fr-ud-train_00001
    # sentence-text: Les commotions cérébrales sont devenu si courantes dans ce sport qu'on les considére presque comme la routine.
    1	Les	le	DET	_	Definite=Def|Gender=Fem|Number=Plur|PronType=Art	2	det	_	_
    2	commotions	commotion	NOUN	_	Gender=Fem|Number=Plur	5	nsubj	_	_
    3	cérébrales	cérébral	ADJ	_	Gender=Fem|Number=Plur	2	amod	_	_
    4	sont	être	AUX	_	Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin	5	aux	_	_
    5	devenu	devenir	VERB	_	Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part	0	root	_	_
    6	si	si	ADV	_	_	7	advmod	_	_
    7	courantes	courant	ADJ	_	Gender=Fem|Number=Plur	5	xcomp	_	_
    8	dans	dans	ADP	_	_	10	case	_	_


    import random
    from collections import Counter

    from pathlib import Path
    from spacy.vocab import Vocab
    from spacy.tagger import Tagger
    from spacy.tokens import Doc
    from spacy.gold import GoldParse

    from sklearn.metrics import accuracy_score, classification_report


    TAG_MAP = {
        'X':{'pos':"X"},
        'NOUN':{'pos':"NOUN"},
        'DET':{'pos':"DET"},
        'ADV':{'pos':"ADV"},
        'AUX':{'pos':"AUX"},
        'PRON':{'pos':"PRON"},
        'SCONJ':{'pos':"SCONJ"},
        'CONJ':{'pos':"CONJ"},
        'VERB':{'pos':"VERB"},
        'PROPN':{'pos':"PROPN"},
        'PUNCT':{'pos':"PUNCT"},
        'INTJ':{'pos':"INTJ"},
        'ADJ':{'pos':"ADJ"},
        'ADP':{'pos':"ADP"},
        'NUM':{'pos':"NUM"},
        'SYM':{'pos':"SYM"},
        'PART':{'pos':"PART"}
    }


    def gen_corpus(path):
        doc = []
        tagset = set()
        with open(path) as file:
            for line in file:
                if line[0].isdigit():
                    features = line.split()
                    word, pos= features[1], features[3]
                    if pos != "_":
                        tagset.add(pos)
                        doc.append((word, pos)) 
                elif len(line.strip()) == 0:
                    if len(doc) > 0:
                        words, tags = zip(*doc)
                        yield (list(words), list(tags))
                    doc = []


    def evaluation(TEST_DATA):
        counter = Counter()
        y_pred, y_true = [], []
        for words, tags in TEST_DATA:
            doc = Doc(vocab, words=words)
            tagger(doc)
            for i, word in enumerate(doc):
                counter[word.pos_ == tags[i]] += 1
                y_pred.append(word.pos_)
                y_true.append(tags[i])
        print(counter)
        return y_pred, y_true


    def ensure_dir(path):
        if not path.exists():
            path.mkdir()


    def gen_tagger(TRAIN_DATA, output_dir=None):
        if output_dir is not None:
            output_dir = Path(output_dir)
            ensure_dir(output_dir)
            ensure_dir(output_dir / "pos")
            ensure_dir(output_dir / "vocab")

        vocab = Vocab(tag_map=TAG_MAP)
        tagger = Tagger(vocab)
        for i in range(50):
            print(i)
            for words, tags in TRAIN_DATA:
                doc = Doc(vocab, words=words)
                gold = GoldParse(doc, tags=tags)
                tagger.update(doc, gold)
            random.shuffle(TRAIN_DATA)
        # tagger.model.end_training()
        
        if output_dir is not None:
            tagger.model.dump(str(output_dir / 'pos' / 'model'))
        with (output_dir / 'vocab' / 'strings.json').open('w') as file_:
            tagger.vocab.strings.dump(file_)

        return tagger, vocab

if __name__ == '__main__':

    train_path = "./UD_French/fr-ud-train.conllu" 
    test_path = "./UD_French/fr-ud-test.conllu" 

    TRAIN_DATA = list(gen_corpus(path))
    TEST_DATA = list(gen_corpus(test_path))
    tagger, vocab = gen_tagger(TRAIN_DATA, "./spacy_postagger")
    y_pred, y_true = evaluation(TEST_DATA)
    classification_report(y_pred, y_true)

The text was updated successfully, but these errors were encountered:

thomasgirault · 2017-02-15T16:32:01Z

It seems that the SpaCy's POS tagger training algorithm is broken.
This does not seem to be a data issue : a quick try with a tagger trained with scikit-learn LogisticRegression with the same corpus starts to show interesting results.

             precision    recall  f1-score   support

        ADJ       0.88      0.90      0.89       393
        ADP       1.00      0.98      0.99      1032
        ADV       0.96      0.95      0.95       339
        AUX       0.93      0.82      0.87       187
       CONJ       0.99      1.00      1.00       174
        DET       0.99      0.98      0.98      1034
       INTJ       0.57      0.57      0.57         7
       NOUN       0.93      0.95      0.94      1228
        NUM       0.88      0.95      0.91       136
       PART       0.97      1.00      0.99        68
       PRON       0.93      0.98      0.95       419
      PROPN       0.91      0.79      0.85       345
      PUNCT       1.00      1.00      1.00       840
      SCONJ       0.90      0.93      0.92       106
        SYM       0.87      0.91      0.89        22
       VERB       0.91      0.94      0.93       684
          X       0.17      0.17      0.17         6

avg / total       0.95      0.95      0.95      7020

from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

def features(sentence, index):
    """ sentence: [w1, w2, ...], index: the index of the word """
    return {
        'word': sentence[index],
        'is_first': index == 0,
        'is_last': index == len(sentence) - 1,
        'is_capitalized': sentence[index][0].upper() == sentence[index][0],
        'is_all_caps': sentence[index].upper() == sentence[index],
        'is_all_lower': sentence[index].lower() == sentence[index],
        'prefix-1': sentence[index][0],
        'prefix-2': sentence[index][:2],
        'prefix-3': sentence[index][:3],
        'suffix-1': sentence[index][-1],
        'suffix-2': sentence[index][-2:],
        'suffix-3': sentence[index][-3:],
        'prev_word': '<s>' if index == 0 else sentence[index - 1],
        'next_word': '</s>' if index == len(sentence) - 1 else sentence[index + 1],
        'has_hyphen': '-' in sentence[index],
        'is_numeric': sentence[index].isdigit(),
        'capitals_inside': sentence[index][1:].lower() != sentence[index][1:]
    }


def untag(tagged_sentence):
    return [w for w, t in tagged_sentence]
 
train_path = "./UD_French/fr-ud-train.conllu" 
test_path = "./UD_French/fr-ud-test.conllu" 

def gen_corpus(path):
    doc = []
    with open(path) as file:
        for line in file:
            if line[0].isdigit():
                features = line.split()
                word, pos= features[1], features[3]
                if pos != "_":
                    doc.append((word, pos)) 
            elif len(line.strip()) == 0:
                if len(doc) > 0:
                    words, tags = zip(*doc)
                    yield (list(words), list(tags))
                doc = []

def evaluation(TEST_DATA):
    y_pred, y_true = [], []
    for words, tags in TEST_DATA:
        for i, (word, pos) in enumerate(tag(words)):
            y_pred.append(pos)
            y_true.append(tags[i])
    return y_pred, y_true
                
                
def transform_to_dataset(tagged_sentences):
    X, y = [], []
 
    for words, tags in tagged_sentences:
        for index, word  in enumerate(words):
            X.append(features(words, index))
            y.append(tags[index])
    return X, y
 
        
def pos_tag(sentence):
    tags = clf.predict([features(sentence, index) for index in range(len(sentence))])
    return zip(sentence, tags)

training_sentences = list(gen_corpus(train_path))
test_sentences = list(gen_corpus(test_path))
 
print( len(training_sentences))   # 14554
print( len(test_sentences)     )    # 298

X, y = transform_to_dataset(training_sentences)
print(len(X)) #  nb_features : 356419

clf = Pipeline([
    ('vectorizer', DictVectorizer(sparse=True)),
    ('classifier',  LogisticRegression(n_jobs=4, max_iter=200, verbose=True))
])
clf.fit(X, y)

X_test, y_test = transform_to_dataset(test_sentences)
print( "Accuracy:", clf.score(X_test, y_test)) # Accuracy: 0.951851851852

# test
sents = ["je voudrais réserver un hotel près de Rennes",
         "comment aller à Paris depuis Rennes ?",
         "où est l' Intermarché le plus proche ?", 
        "jusqu' à quelle heure la piscine Saint-Georges est ouverte aujourd ' hui ?"]
for s in sents:
    for w, pos in pos_tag(s.split()):
        print("%s/%s" % (w, pos), end=' ')
    print()

    je/PRON voudrais/AUX réserver/VERB un/DET hotel/NOUN près/ADV de/ADP Rennes/PROPN 

    comment/ADV aller/VERB à/ADP Paris/PROPN depuis/ADP Rennes/PROPN ?/PUNCT 

    où/PRON est/VERB l'/DET Intermarché/PROPN le/DET plus/ADV proche/ADJ ?/PUNCT 

    jusqu'/ADP à/ADP quelle/DET heure/NOUN la/DET piscine/NOUN Saint-Georges/PROPN est/AUX ouverte/VERB aujourd/ADJ '/PUNCT hui/ADV ?/PUNCT 

y_true, y_pred = evaluation(test_sentences)
for l in classification_report(y_true, y_pred).split('\n'):
    print(l)

             precision    recall  f1-score   support

        ADJ       0.88      0.90      0.89       393
        ADP       1.00      0.98      0.99      1032
        ADV       0.96      0.95      0.95       339
        AUX       0.93      0.82      0.87       187
       CONJ       0.99      1.00      1.00       174
        DET       0.99      0.98      0.98      1034
       INTJ       0.57      0.57      0.57         7
       NOUN       0.93      0.95      0.94      1228
        NUM       0.88      0.95      0.91       136
       PART       0.97      1.00      0.99        68
       PRON       0.93      0.98      0.95       419
      PROPN       0.91      0.79      0.85       345
      PUNCT       1.00      1.00      1.00       840
      SCONJ       0.90      0.93      0.92       106
        SYM       0.87      0.91      0.89        22
       VERB       0.91      0.94      0.93       684
          X       0.17      0.17      0.17         6

avg / total       0.95      0.95      0.95      7020

honnibal · 2017-02-15T19:11:47Z

Hi,

Apologies that the process for this is still quite unpolished, and the docs aren't fully helpful.

I haven't sat down with your script yet, but it looks to me like you're not iadding the words to the vocab, and not saving the vocab words. Could you compare with the script in bin/parser/train_ud.py? I think you'll see that there's a step where _ = nlp.vocab[word] is performed, so that the vocab is populated prior to training.

We can and will do better than this API --- but for now I think that might be what's wrong with your code.

Matt

thomasgirault · 2017-02-16T12:12:11Z

Hi Matt,

Thank you for your answer : your advises helped me a lot.
I simply tried the script in bin/parser/train_ud.py :

$ python train_ud.py ./UD_French/fr-ud-train.conllu ./UD_French/fr-ud-dev.conllu ./spacy_french

It learns very well but the finalization seems to break the model :

0:      78.420  92.605
1:      80.818  94.582
2:      81.394  93.439
....
14:     86.510  95.714
# end_training()
14:     28.952  0.352   0.101

So I redefined my own end_training() function to avoid model averaging during finalization.
The model then keeps its weights for future reuse :

0:      78.420  92.605
1:      80.265  93.232
2:      82.223  94.984
...
14:     86.466  96.083
# end_training()
14:     86.466  82.892  96.083

Now, the evaluation shows nice results that are comparable to LogisticRegression from scikit-learn :

             precision    recall  f1-score   support

        ADJ       0.85      0.90      0.87       381
        ADP       0.99      0.99      0.99      1015
        ADV       0.95      0.90      0.93       352
        AUX       0.90      0.87      0.89       171
       CONJ       0.99      1.00      0.99       173
        DET       0.99      0.98      0.98      1029
       INTJ       0.57      0.50      0.53         8
       NOUN       0.93      0.95      0.94      1226
        NUM       0.89      0.94      0.91       140
       PART       0.97      1.00      0.99        68
       PRON       0.97      0.95      0.96       450
      PROPN       0.90      0.82      0.86       328
      PUNCT       1.00      1.00      1.00       839
      SCONJ       0.84      0.94      0.88        98
        SYM       0.91      0.84      0.87        25
       VERB       0.93      0.92      0.92       708
          X       0.17      0.11      0.13         9

avg / total       0.95      0.95      0.95      7020

It's a good start and it's interesting that the script also provides a dependency parser for French.
I will evaluate this in a couple of weeks (after holidays) and I will also give a try to lemmatisation and word vectors : it could maybe help to contribute to French language in SpaCy.

Thanks again for your help.

Thomas

davidvartanian · 2017-02-16T13:03:15Z

Can you share the end_training() function and/or explain what changes you made?
I need to do the same in many other languages, so I could also contribute with more support for other languages to spaCy.

thomasgirault · 2017-02-16T13:49:00Z

Here is the code where the Language.end_training() method is replaced in bin/parser/train_ud.py (lines [112,113]) :

def main(train_loc, dev_loc, model_dir, tag_map_loc=None):
    ...
    end_training(model_dir, vocab=vocab, tagger=tagger, parser=parser)
    # nlp = Language(vocab=vocab, tagger=tagger, parser=parser)
    # nlp.end_training(model_dir)

And, this this how I simply redefined my own end_training() function to avoid model averaging during finalization with the original method :

# adapted from spacy.language.end_training (lines [357,402])
def end_training(path, vocab, tagger, parser):
    if isinstance(path, str):
        path = pathlib.Path(path)

    if tagger:
        # disable the tagger end_training()      
        # tagger.model.end_training()
        tagger.model.dump(str(path / 'pos' / 'model'))
    if parser:
        # disable the parser end_training()      
        # parser.model.end_training()
        parser.model.dump(str(path / 'deps' / 'model'))
    strings_loc = path / 'vocab' / 'strings.json'
    with strings_loc.open('w', encoding='utf8') as file_:
        vocab.strings.dump(file_)
    vocab.dump(path / 'vocab' / 'lexemes.bin')

    if tagger:
        tagger_freqs = list(tagger.freqs[TAG].items())
    else:
        tagger_freqs = []
    if parser:
        dep_freqs = list(parser.moves.freqs[DEP].items())
        head_freqs = list(parser.moves.freqs[HEAD].items())
    else:
        dep_freqs = []
        head_freqs = []

    # disable NER stuff
    with (path / 'vocab' / 'serializer.json').open('w') as file_:
        file_.write(
            json.dumps([
                (TAG, tagger_freqs),
                (DEP, dep_freqs),
                # NER TAGS DISABLED HERE
                (HEAD, head_freqs)
            ]))

davidvartanian · 2017-02-16T14:45:50Z

Thank you Thomas, I'll play around this and next week. I hope I can help with something.

raphael0202 · 2017-03-18T17:33:51Z

@thomasgirault Hi, I've noticed you're using Spacy 1.5/1.6 to train the tagger.
I've work on spacy tokenization for French (it is available on master, and should be available in the next release), and it should (hopefully) improve the tagger accuracy.

ines · 2017-03-18T20:47:40Z

spaCy 1.7.0 (including @raphael0202's work on French tokenization) is now up!

sjmielke · 2017-05-03T12:56:02Z

Sorry to come back to this but I'm running into the same issue (using the PTB as training data). Even with a handful of sentences, the sklearn classifier thomasgirault used above gives me decent values, while spacy never exceeds 39%.
Calling _ = vocab[w] for all words in my trainset does not seem to improve things, neither does removing tagger.model.end_training() (in fact, then I can't even reach 30%).
Is there anything else I'm missing or is this broken?

EDIT: my code @ https://gist.github.com/sjmielke/95ed4aeb7f5dee2dee1a7bf2092cb228

kamac · 2017-07-05T19:42:38Z

I've got the same problem @sjmielke does. Even when I'm trying to learn on english UD set, the accuracy floats around 33%. There's got to be something we're both doing wrong.

Adding _ = vocab[word] didn't change anything.

My code: https://gist.github.com/kamac/a7bc139f62488839a8118214a4d932f2

I'm using version 1.8.2

lock · 2018-05-08T19:27:57Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

thomasgirault changed the title ~~Low accuracy of POS tagging trained on Universal Dependency French corpus~~ Low accuracy of POS tagger trained on Universal Dependency French corpus Feb 15, 2017

ines added docs Documentation and website usage General spaCy usage labels Feb 15, 2017

ines added this to the Improve training API milestone Feb 18, 2017

thomasgirault closed this as completed Feb 23, 2017

thomasgirault reopened this Feb 23, 2017

ines closed this as completed Mar 18, 2017

kamac mentioned this issue Jul 7, 2017

Low accuracy when training english PoS #1179

Closed

lock bot locked as resolved and limited conversation to collaborators May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low accuracy of POS tagger trained on Universal Dependency French corpus #827

Low accuracy of POS tagger trained on Universal Dependency French corpus #827

thomasgirault commented Feb 14, 2017

thomasgirault commented Feb 15, 2017

honnibal commented Feb 15, 2017

thomasgirault commented Feb 16, 2017

davidvartanian commented Feb 16, 2017

thomasgirault commented Feb 16, 2017

davidvartanian commented Feb 16, 2017

raphael0202 commented Mar 18, 2017 •

edited

Loading

ines commented Mar 18, 2017

sjmielke commented May 3, 2017 •

edited

Loading

kamac commented Jul 5, 2017 •

edited

Loading

lock bot commented May 8, 2018

Low accuracy of POS tagger trained on Universal Dependency French corpus #827

Low accuracy of POS tagger trained on Universal Dependency French corpus #827

Comments

thomasgirault commented Feb 14, 2017

Your Environment

thomasgirault commented Feb 15, 2017

honnibal commented Feb 15, 2017

thomasgirault commented Feb 16, 2017

davidvartanian commented Feb 16, 2017

thomasgirault commented Feb 16, 2017

davidvartanian commented Feb 16, 2017

raphael0202 commented Mar 18, 2017 • edited Loading

ines commented Mar 18, 2017

sjmielke commented May 3, 2017 • edited Loading

kamac commented Jul 5, 2017 • edited Loading

lock bot commented May 8, 2018

raphael0202 commented Mar 18, 2017 •

edited

Loading

sjmielke commented May 3, 2017 •

edited

Loading

kamac commented Jul 5, 2017 •

edited

Loading