Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to run NER in inference mode #54

Closed
soumyavhasure opened this issue Oct 30, 2019 · 10 comments
Closed

Unable to run NER in inference mode #54

soumyavhasure opened this issue Oct 30, 2019 · 10 comments

Comments

@soumyavhasure
Copy link

I trained the BioBERT model using the v1.1 weights found in the NAVER Github repository. I followed the instructions provided, but had to modify "--init_checkpoint=$BIOBERT_DIR/biobert_model.ckpt" part of the code to "--init_checkpoint=$BIOBERT_DIR/model.ckpt-1000000." Once the model had been trained, I tried running it in inference mode with "--do_train=false --do_eval=true --do_predict=true" and everything else the same as when I trained the model. I am able to see the token-level evaluation result printed as stdout format. However, this does not create the "token_test.txt and label_test.txt in output_dir." I'm not sure if I'm doing something wrong. Any help would be appreciated!

@wonjininfo
Copy link
Member

Hi please check this comment from issue #39
Thanks!
Wonjin

@yrahul3910
Copy link

yrahul3910 commented Nov 3, 2019

I can confirm this issue. Trying the workaround with --num_train_epochs=0.1 doesn't seem to work. The issue is with the two files token_test.txt and label_test.txt not existing. Is there something we're missing? What generates these files?

Edit: I dug a little bit. Running grep token_test.txt * in the source directory showed me this is only referenced in run_ner.py, where this path is called token_path. In line 586, it appears you remove the file if it exists; but then in line 610, you try reading it, while none of the lines in between appear to use this variable (and therefore, there's no way to create this file).

@wonjininfo
Copy link
Member

wonjininfo commented Nov 3, 2019

Hi all,
Would you please check this recent comment ?
In short, you need to finetune first, and use that "pretrained" weight as inference mode for your next experiments. (token_test.txtwill be generated while you finetune our weights )

@Anushka1610
Copy link

I experienced similar behavior because my test file didn't have a blank newline at the end. DataProcessor._read_data() expects one, and will return an empty list of token/label pairs if it processes the entire file without finding one. Adding the newline fixed this issue for me.

@yrahul3910
Copy link

@Anushka1610 Interesting. Could you describe the structure of your test file? Intuitively, I assume it should only hold one word or one sentence per line. Is this correct?

@Anushka1610
Copy link

Anushka1610 commented Nov 15, 2019

@yrahul3910 The structure of your .tsv should mimic that of the ones provided. You're correct that it should be one word/label pair per line, separated by a tab:

word\tlabel\n

For example, let's cat NERdata/NCBI-disease/train.tsv | head -10:

Identification  O
of      O
APC2    O
,       O
a       O
homologue       O
of      O
the     O
adenomatous     B
polyposis       I

@yrahul3910
Copy link

@Anushka1610 Thanks for clarifying! How do you run prediction then? Given a sequence of words, I'd like to predict their class; so it feels odd to me that I need to provide classes for the test files. Is there a way around this? Alternatively, do you think I could simply give them all a random class and just look at the predictions generated by the model?

@yrahul3910
Copy link

I still can't get this to work. In my NER_DIR, I have train.tsv, train_dev.tsv, devel.tsv, and test.tsv. The former 3 are copied as-is from the s800 data set provided in the repo. For the last one, I've formatted it to match the format described above. If it makes any difference, in my test.tsv, I labeled everything as class B (since I do not know the labels and prediction is my goal).

@demongolem
Copy link

demongolem commented Dec 23, 2019

Hi all,
Would you please check this recent comment ?
In short, you need to finetune first, and use that "pretrained" weight as inference mode for your next experiments. (token_test.txtwill be generated while you finetune our weights )

I would like to add a question to this though. Line 586 says If FLAGS.do_predict, so it seems that during prediction the token_text.txt is deleted. I agree that token_test.txt is created during the fine-tuning process, and I have created that file, but why would this file be deleted during prediction? I cannot see another place before the fatal error in line 615 when we try to read token_test.txt where the file token_test.txt would be created again. If I comment out the remove portion, I get the same token_test.txt with the timestamp of when the prediction completed. Must there be a new token_test.txt file generated during this prediction phase? If I comment out as indicated, the detokenizer in brocades will not work because of unequal length between the label and the token file.

@Mayar2009
Copy link

As understand
in short for trainig
python run_ner.py
--vocab_file=$BIOBERT_DIR/vocab.txt
--bert_config_file=$BIOBERT_DIR/bert_config.json
--init_checkpoint=$BIOBERT_DIR/biobert_model.ckpt
--data_dir=$NER_DIR/
--do_train=true
--do_eval=true
--num_train_epochs=10.0
--output_dir=/tmp/bioner/(pre-trained dir)

for predicting
python run_ner.py
--vocab_file=$BIOBERT_DIR/vocab.txt
--bert_config_file=$BIOBERT_DIR/bert_config.json
--init_checkpoint=$BIOBERT_DIR/biobert_model.ckpt
--data_dir=$NER_DIR/
--do_train=true
--do_predict=true
--num_train_epochs= 4.0
--output_dir=/tmp/bioner/(pre-trained dir)

what we will change in predicting stage just write( --do_predict=true ) instead of ( --do_eval=true ) and put smaller number for num_train_epochs

right?
for me I have
NER_DIR = "/content/drive/My Drive/Colab Notebooks/Bert/BioBert/NERdata/BC5CDR-chem"
BIOBERT_DIR = "/content/drive/My Drive/Colab Notebooks/Bert/BioBert/biobert_v1.1_pubmed/biobert_v1.1_pubmed"
Output_Dir = "/content/drive/My Drive/Colab Notebooks/Bert/BioBert/output/"

so when I fune tuned and tried to predict like I wrote above
I got this error

WARNING:tensorflow:From /content/drive/My Drive/Colab Notebooks/Bert/BioBert/biobert-master/optimization.py:87: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

/usr/local/lib/python3.6/dist-packages/absl/flags/_validators.py:359: UserWarning: Flag --task_name has a non-None default value; therefore, mark_flag_as_required will pass even if flag is not specified in the command line!
'command line!' % flag_name)
WARNING:tensorflow:From /content/drive/My Drive/Colab Notebooks/Bert/BioBert/biobert-master/run_ner.py:646: The name tf.app.run is deprecated. Please use tf.compat.v1.app.run instead.

Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/absl/flags/_flag.py", line 181, in _parse
return self.parser.parse(argument)
File "/usr/local/lib/python3.6/dist-packages/absl/flags/_argument_parser.py", line 152, in parse
val = self.convert(argument)
File "/usr/local/lib/python3.6/dist-packages/absl/flags/_argument_parser.py", line 213, in convert
return float(argument)
ValueError: could not convert string to float:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/content/drive/My Drive/Colab Notebooks/Bert/BioBert/biobert-master/run_ner.py", line 646, in
tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 293, in run
flags_parser,
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 362, in _run_init
flags_parser=flags_parser,
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 212, in _register_and_parse_flags_with_usage
args_to_main = flags_parser(original_argv)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/platform/app.py", line 31, in _parse_flags_tolerate_undef
return flags.FLAGS(_sys.argv if argv is None else argv, known_only=True)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/platform/flags.py", line 112, in call
return self.dict['__wrapped'].call(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/absl/flags/_flagvalues.py", line 626, in call
unknown_flags, unparsed_args = self._parse_args(args, known_only)
File "/usr/local/lib/python3.6/dist-packages/absl/flags/_flagvalues.py", line 774, in _parse_args
flag.parse(value)
File "/usr/local/lib/python3.6/dist-packages/absl/flags/_flag.py", line 166, in parse
self.value = self._parse(argument)
File "/usr/local/lib/python3.6/dist-packages/absl/flags/_flag.py", line 184, in _parse
'flag --%s=%s: %s' % (self.name, argument, e))
absl.flags._exceptions.IllegalFlagValueError: flag --num_train_epochs=: could not convert string to float:

what should I do in this case?

wonjininfo added a commit that referenced this issue Apr 10, 2020
Fix known bugs about NER task.
1. NER inference mode (only predict) #39 #50 #54
2. Check OUTPUT dir and make it if not exists
3. Fixed "missing labels" problem.
3-1. See Line 631 of run_ner.py for [PAD] related problem
3-2. See biocodes/ner_detokenize.py for max_seq_length related problems
4. Refactored a few lines (ex. os.path.join, replaced **NULL** with [PAD])
5. Functionize detokenizer (See biocodes/ner_detokenize.py for details)
6. misc

If you wish to use previous version, pleas use tag v20200409 https://github.com/dmis-lab/biobert/tree/v20200409
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants