Score() on Test Dataset / Low F-score Ner training #7442

tristancatteeuw · 2021-03-15T16:29:48Z

tristancatteeuw
Mar 15, 2021

Hello,

I am trying to train a custom NER recognizer that would work on Resumes (with +/- 15 new entity tags). However, I am getting a relatively low F-Score (0.63).

Is there any way to use the Scorer ([https://spacy.io/api/scorer#_title]) directly on the test dataset used to train the model, so that I can verify which particular entities work better and which ones don't?

I would like to be able to evaluate what works, because I just almost doubled the amount of training data for my model without seeing a significant improvement to my model's f-score. (65 000 words --> 0.60 F-Score, 112 000 words --> 0.63)

I can already see by testing a little that it seems that my 'STUDIES' tag that aims to find the field in which someone did his/her studies has problems that I don't understand. I finds well the expected values but it also gets wrongly matched on random punctuation such as : "/ ? / . / etc.

Else, I'd also take some general advice to improve my model. Here is the config file I am using :

`[paths]
train = "C:\Users\Tristan\Documents\CV\trainNew"
dev = "C:\Users\Tristan\Documents\CV\testNew"
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "xx"
pipeline = ["transformer","ner"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@Tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner]
factory = "ner"
moves = null
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "bert-base-multilingual-uncased"

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.tokenizer_config]
use_fast = true

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 500
gold_preproc = false
limit = 0
augmenter = null

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256
get_length = null

[training.logger]
@Loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
ents_per_type = null
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0

[pretraining]

[initialize]
vectors = null
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]`

Answered by svlandeg

Mar 15, 2021

Is there any way to use the Scorer ([https://spacy.io/api/scorer#_title]) directly on the test dataset used to train the model, so that I can verify which particular entities work better and which ones don't?

You can run scorer.score on a list of Example objects, though that will only give you the precision/recall/F-scores. If you just want some example cases to look at, you could take a fixed part of your test set, run each example through scorer.score([example]) and only print those where precision and recall are not 100% (or something such).

You could also write your own custom evaluation by comparing the predicted entities to the gold-standard ones - there's some related example cod…

View full answer

svlandeg · 2021-03-15T16:51:31Z

svlandeg
Mar 15, 2021

Is there any way to use the Scorer ([https://spacy.io/api/scorer#_title]) directly on the test dataset used to train the model, so that I can verify which particular entities work better and which ones don't?

You can run scorer.score on a list of Example objects, though that will only give you the precision/recall/F-scores. If you just want some example cases to look at, you could take a fixed part of your test set, run each example through scorer.score([example]) and only print those where precision and recall are not 100% (or something such).

You could also write your own custom evaluation by comparing the predicted entities to the gold-standard ones - there's some related example code from the Prodigy forum here: https://support.prodi.gy/t/show-false-negative-false-positives-in-ner/3223/5

I am trying to train a custom NER recognizer that would work on Resumes (with +/- 15 new entity tags). However, I am getting a relatively low F-Score (0.63).

Predicting named entities from resumes might generally be a bit challenging, as you often don't have a lot of context or syntax/grammar in the text. I have no idea what the expected upperbound for this type of prediction would be to be honest...

1 reply

tristancatteeuw Mar 16, 2021
Author

Hello Sofie,

Thank you for your answer! I experimented a little more and it seems that the low performance is at least partly due to the fact that it is trained on CV in multiple languages (english and french). If instead of using the multi-language model I train the "en_core_web_trf" model with only the english Resumes, I get a pretty decent jump in performance which is encouraging. I just have to determine now if this is due to my french data being of poor quality (or not enough quantity) or something else.
I expect the model to not work perfectly as you said because resumes is a low-context text, but I saw reports of people doing it with a way better accuracy than 0.63 which is why I was surprised at the low number.

mandar-avhad · 2021-03-27T06:16:29Z

mandar-avhad
Mar 27, 2021

Hello, I'm also working on a similar NER project. Can you share the steps training of ner on customised data and the code if possible?

Also, can you share how to compute accuracy of trained ner model in spacy 3? Please share any source or code with reference to that.

Thank you

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Score() on Test Dataset / Low F-score Ner training #7442

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Score() on Test Dataset / Low F-score Ner training #7442

Uh oh!

Uh oh!

tristancatteeuw Mar 15, 2021

Replies: 2 comments · 1 reply

Uh oh!

svlandeg Mar 15, 2021

Uh oh!

tristancatteeuw Mar 16, 2021 Author

Uh oh!

mandar-avhad Mar 27, 2021

tristancatteeuw
Mar 15, 2021

Replies: 2 comments 1 reply

svlandeg
Mar 15, 2021

tristancatteeuw Mar 16, 2021
Author

mandar-avhad
Mar 27, 2021