Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to train openai-community / gpt2 model for custom NER on SpaCy framework #13334

Closed
Project-NSE opened this issue Feb 16, 2024 · 1 comment
Labels
feat/llm Feature: LLMs (incl. spacy-llm)

Comments

@Project-NSE
Copy link

I want to create a custom NER tag using GPT2. I want to use this model. I am familiar with SpaCy custom training framework. I formatted the config.cfg file as per the requirement. The config.cfg file is as follows

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","ner"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
vectors = {"@vectors":"spacy.Vectors.v1"}
tokenizer.pad_token = tokenizer.eos_token

[components]

[components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
scorer = {"@scorers":"spacy.ner_scorer.v1"}
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "openai-community/gpt2"
mixed_precision = false

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.grad_scaler_config]

[components.transformer.model.tokenizer_config]
use_fast = true

[components.transformer.model.transformer_config]

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
annotating_components = []
before_to_disk = null
before_update = null

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256
get_length = null

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
ents_per_type = null

[pretraining]

[initialize]
vectors = ${paths.vectors}
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

I received an error "ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token (tokenizer.pad_token = tokenizer.eos_token e.g.) or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'})."

I posted this issue to the OpenAi forum and came to know from @younesbelkada that I need to call tokenizer.pad_token = tokenizer.eos_token before launching the training.

I modified the config.cfg file again. The modification portion is as follows

[nlp]
lang = "en"
pipeline = ["transformer","ner"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
vectors = {"@vectors":"spacy.Vectors.v1"}
tokenizer.pad_token = tokenizer.eos_token

Now I received the error "✘ Config validation error
nlp -> tokenizer.pad_token extra fields not permitted
{'lang': 'en', 'pipeline': ['transformer', 'ner'], 'batch_size': 128, 'disabled': [], 'before_creation': None, 'after_creation': None, 'after_pipeline_creation': None, 'tokenizer': {'@Tokenizers': 'spacy.Tokenizer.v1'}, 'vectors': {'@vectors': 'spacy.Vectors.v1'}, 'tokenizer.pad_token': 'tokenizer.eos_token'}"

Can you please let me know how can I fix this issue and train the GPT2 model.

Thank you in advance.

@svlandeg svlandeg added the feat/llm Feature: LLMs (incl. spacy-llm) label Feb 22, 2024
@svlandeg
Copy link
Member

Hi! Let me transfer this question to the discussion forum and we'll follow up with you there.

@explosion explosion locked and limited conversation to collaborators Feb 22, 2024
@svlandeg svlandeg converted this issue into discussion #13349 Feb 22, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
feat/llm Feature: LLMs (incl. spacy-llm)
Projects
None yet
Development

No branches or pull requests

2 participants