-
Notifications
You must be signed in to change notification settings - Fork 453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue in running ner_dekonize code #107
Comments
Hi, |
when I tried it on the NCBI-disease dataset it worked fine, now I am using Kaggle covid dataset on research papers https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge/tasks?taskId=570. |
It must be related to the pre-processing step. In my experience, some Unicode space might cause this problem in tokenization steps. I used complex code (modified from here) to do the step but I think normalize function from unicodedata package can be a good option. from unicodedata import normalize Normalizing all the original text will help. Also, you need to check 1) the maximum length of a word and 2)the handling of special characters such as '(', ')', '-', '_' ... etc. The code is from my colleagues and needs to be refactored a bit more. Thanks :) |
These special characters are in the train.tsv of each dataset, how can these be causing issue? |
Sorry I meant spacing near special characters. We added a space before and after some spaces. |
I have used from word_tokenize from nltk.tokenize to make the test.tsv file for kaggle tokens. I have understood what you explained. Can you direct me to the tokenization code which is compatible with this model. |
Hi, the pre-processing of the datasets was mostly done by other co-authors. Following is the example usage of the code from ops import json_to_sent, input_form
data = [{
"pmid":"123",
#"title":"I want coffee", # not necessary
"abstract": "This is a dummy data to learn how these codes (ner.ops - json_to_sent and input form) are working. Thanks."
}]
sentData = json_to_sent(data, is_raw_text=True) # set is_raw_text=True if you do not use "title"
sentData This will split the input sequence into multiple sentences.
Use input_form to tokenize the sentences. MAX_CHARS_WORD = 22
for key, values in input_form(sentData, max_input_chars_per_word = MAX_CHARS_WORD)["123"].items():
print(key+": ", values)
The code (ops.py) : #
# Original code from https://github.com/dmis-lab/bern/blob/master/biobert_ner/ops.py
# Modified by Wonjin Yoon (wonjin.info) for BioBERT SeqTag task
#
import numpy as np
import re
tokenize_regex = re.compile(r'([0-9a-zA-Z]+|[^0-9a-zA-Z])')
def json_to_sent(data, is_raw_text=False):
'''data: list of json file [{pmid,abstract,title}, ...] '''
out = dict()
for paper in data:
sentences = list()
if is_raw_text:
# assure that paper['abstract'] is not empty
abst = sentence_split(paper['abstract'])
if len(abst) != 1 or len(abst[0].strip()) > 0:
sentences.extend(abst)
else:
# assure that paper['title'] is not empty
if len(CoNLL_tokenizer(paper['title'])) < 50:
title = [paper['title']]
else:
title = sentence_split(paper['title'])
if len(title) != 1 or len(title[0].strip()) > 0:
sentences.extend(title)
if len(paper['abstract']) > 0:
abst = sentence_split(' ' + paper['abstract'])
if len(abst) != 1 or len(abst[0].strip()) > 0:
sentences.extend(abst)
out[paper['pmid']] = dict()
out[paper['pmid']]['sentence'] = sentences
return out
def input_form(sent_data, max_input_chars_per_word=20):
'''sent_data: dict of sentence, key=pmid {pmid:[sent,sent, ...], pmid: ...}'''
for pmid in sent_data:
sent_data[pmid]['words'] = list()
sent_data[pmid]['wordPos'] = list()
doc_piv = 0
for sent in sent_data[pmid]['sentence']:
wids = list()
wpos = list()
sent_piv = 0
tok = CoNLL_tokenizer(sent)
for w in tok:
if len(w) > max_input_chars_per_word: # was 20
wids.append(w[:max_input_chars_per_word]) # was 10
else:
wids.append(w)
start = doc_piv + sent_piv + sent[sent_piv:].find(w)
end = start + len(w) - 1
sent_piv = end - doc_piv + 1
wpos.append((start, end))
doc_piv += len(sent)
sent_data[pmid]['words'].append(wids)
sent_data[pmid]['wordPos'].append(wpos)
return sent_data
def isInt(string):
try:
int(string)
return True
except ValueError:
return False
def isFloat(string):
try:
float(string)
return True
except ValueError:
return False
def softmax(logits):
out = list()
for logit in logits:
temp = np.subtract(logit, np.max(logit))
p = np.exp(temp) / np.sum(np.exp(temp))
out.append(np.max(p))
return out
def CoNLL_tokenizer(text):
rawTok = [t for t in tokenize_regex.split(text) if t]
assert ''.join(rawTok) == text
tok = [t for t in rawTok if t != ' ']
return tok
def sentence_split(text):
sentences = list()
sent = ''
piv = 0
for idx, char in enumerate(text):
if char in "?!":
if idx > len(text) - 3:
sent = text[piv:]
piv = -1
else:
sent = text[piv:idx + 1]
piv = idx + 1
elif char == '.':
if idx > len(text) - 3:
sent = text[piv:]
piv = -1
elif (text[idx + 1] == ' ') and (
text[idx + 2] in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ-"' + "'"):
sent = text[piv:idx + 1]
piv = idx + 1
if sent != '':
toks = CoNLL_tokenizer(sent)
if len(toks) > 100:
while True:
rawTok = [t for t in tokenize_regex.split(sent) if t]
cut = ''.join(rawTok[:200])
sent = ''.join(rawTok[200:])
sentences.append(cut)
if len(CoNLL_tokenizer(sent)) < 100:
if sent.strip() == '':
sent = ''
break
else:
sentences.append(sent)
sent = ''
break
else:
sentences.append(sent)
sent = ''
if piv == -1:
break
if piv != -1:
sent = text[piv:]
toks = CoNLL_tokenizer(sent)
if len(toks) > 100:
while True:
rawTok = [t for t in tokenize_regex.split(sent) if t]
cut = ''.join(rawTok[:200])
sent = ''.join(rawTok[200:])
sentences.append(cut)
if len(CoNLL_tokenizer(sent)) < 100:
if sent.strip() == '':
sent = ''
break
else:
sentences.append(sent)
sent = ''
break
else:
sentences.append(sent)
sent = ''
return sentences |
I am trying to run de_tokenize.py and getting this error. Can anyone help me in this?
437923 485173
Error! : len(ans['labels']) != len(bert_pred['labels']) : Please report us
Traceback (most recent call last):
File "biocodes/ner_detokenize.py", line 88, in
detokenize(args.answer_path, args.token_test_path, args.label_test_path, args.output_dir)
File "biocodes/ner_detokenize.py", line 77, in detokenize
raise
RuntimeError: No active exception to reraise
The text was updated successfully, but these errors were encountered: