Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to improve named entity normalization for human proteins? #2

Closed
dhimmel opened this issue Jan 14, 2022 · 5 comments
Closed

How to improve named entity normalization for human proteins? #2

dhimmel opened this issue Jan 14, 2022 · 5 comments

Comments

@dhimmel
Copy link

dhimmel commented Jan 14, 2022

Very excited to see BERN2! Really nice work so far.

I'm looking to map certain mentions of proteins to standard identifiers. Here's a list of these proteins, where each protein is also followed by a direction of activity:

3 beta hydroxysteroid dehydrogenase 5 stimulator; AF4/FMR2 protein 2 inhibitor; Adenylate cyclase 2 stimulator; Alpha gamma adaptin binding protein p34 stimulator; BR serine threonine protein kinase 1 stimulator; Complement Factor B stimulator; DNA gyrase B inhibitor; Ectonucleotide pyrophosphatase-PDE-3 stimulator; Falcipain 1 stimulator; Homeobox protein Nkx 2.4 stimulator; ISLR protein inhibitor; Integrin alpha-IIb/beta-4 antagonist; Inter alpha trypsin inhibitor H5 stimulator; Interleukin receptor 17B antagonist; Isopropylmalate dehydrogenase stimulator; Methylthioadenosine nucleosidase stimulator; Patched domain containing protein 2 inhibitor; Protein FAM161A stimulator; Protocadherin gamma A1 inhibitor; Ring finger protein 4 stimulator; SMAD-9 inhibitor; Small ubiquitin related modifier 1 inhibitor; Sodium-dicarboxylate cotransporter-1 inhibitor; Sorting nexin 9 inhibitor; Sugar phosphate exchanger 2 stimulator; Transcription factor p65 stimulator; Tumor necrosis factor 14 ligand inhibitor; Ubiquitin-conjugating enzyme E21 stimulator; Unspecified ion channel inhibitor; Zinc finger BED domain protein 6 inhibitor

Using the nice web interface, I get:

image

So overall BERN2 does a good job recognizing the protein mentions. However, we actually already know what the protein text is, and are more interested in normalization. Most of the gene/protein mentions receive "ID: CUI-less". Any advice on how to improve the performance of named entity normalization for human proteins?

I see that the website notes that normalization is done by https://github.com/dmis-lab/BioSyn, so feel free to migrate this issue to that repo if it's best there.

@mjeensung
Copy link
Contributor

Hi @dhimmel,
Thank you for your interest in BERN2.

BioSyn, the neural network normalizer, currently only supports disease and chemical types. Please note that we place an asterisk next to a CUI that has been normalized by 'BioSyn' (e.g., ID: MESH:D013217*).

For the gene/protein type, we are using an off-the-shelf gene type normalizer GNormPlus and the human proteins in your examples are the entities that GNormPlus could not normalize.

If a better gene/protein type normalizer is released in the future, we are planning to replace it with the current gene/protein type normalizer.

@dhimmel
Copy link
Author

dhimmel commented Jan 19, 2022

Thanks @mjeensung for the clarification. Feel free to post any leads on better gene/protein normalizers here... I'm happy to help evaluate.

Looking at the GNormPlus docs, it does "mention recognition and concept normalization". So are you able to just apply GNormPlus at the concept normalization stage for genes, while using the mention recognition from BERN2? I think the code I'm asking about is:

BERN2/bern2/normalizer.py

Lines 307 to 401 in 20cef6b

# call GNormPlus
elif ent_type == 'gene':
# create socket
# s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s = socket.socket()
s.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
try:
s.connect((self.HOST, self.GENE_PORT))
except ConnectionRefusedError as cre:
print('Check GNormPlus jar', cre)
s.close()
return oids
# 1. Write as input files to normalizers
norm_inp_path = os.path.join(self.NORM_INPUT_DIR[ent_type],
input_filename)
norm_abs_path = os.path.join(self.NORM_INPUT_DIR[ent_type],
base_thread_name + '.txt')
space_type = ' ' + ent_type
with open(norm_inp_path, 'w') as norm_inp_f:
with open(norm_abs_path, 'w') as norm_abs_f:
for saved_item in saved_items:
entities = saved_item['entities'][ent_type]
if len(entities) == 0:
continue
abstract_title = saved_item['abstract']
ent_names = list()
for loc in entities:
e_name = abstract_title[loc['start']:loc['end']]
if len(e_name) > len(space_type) \
and space_type \
in e_name.lower()[-len(space_type):]:
# print('Replace', e_name,
# 'w/', e_name[:-len(space_type)])
e_name = e_name[:-len(space_type)]
ent_names.append(e_name)
norm_abs_f.write(saved_item['pmid'] + '||' +
abstract_title + '\n')
norm_inp_f.write('||'.join(ent_names) + '\n')
# 2. Run normalizers
gene_input_dir = os.path.abspath(
os.path.join(self.NORM_INPUT_DIR[ent_type]))
gene_output_dir = os.path.abspath(
os.path.join(self.NORM_OUTPUT_DIR[ent_type]))
setup_dir = self.NORM_DICT_PATH[ent_type] # setup.txt
# start jar
jar_args = '\t'.join(
[gene_input_dir, gene_output_dir, setup_dir, '9606', # human
base_thread_name]) + '\n'
s.send(jar_args.encode('utf-8'))
# input_stream = struct.pack('>H', len(jar_args)) + jar_args.encode('utf-8')
# s.send(input_stream)
s.recv(bufsize)
s.close()
# 3. Read output files of normalizers
norm_out_path = os.path.join(gene_output_dir, output_filename)
if os.path.exists(norm_out_path):
with open(norm_out_path, 'r') as norm_out_f, \
open(norm_inp_path, 'r') as norm_in_f:
for line, input_l in zip(norm_out_f, norm_in_f):
gene_ids, gene_mentions = line[:-1].split('||'), \
input_l[:-1].split('||')
for gene_id, gene_mention in zip(gene_ids,
gene_mentions):
eid = None
if gene_id.lower() == 'cui-less':
eid = self.NO_ENTITY_ID
else:
bar_idx = gene_id.find('-')
if bar_idx > -1:
gene_id = gene_id[:bar_idx]
eid = gene_id
eid = "EntrezGene:" + eid
oids.append(eid)
# 5. Remove output files
os.remove(norm_out_path)
else:
print('Not found!!!', norm_out_path)
# Sad error handling
for _ in range(len(name_ptr)):
oids.append(self.NO_ENTITY_ID)
# 4. Remove input files
os.remove(norm_inp_path)
os.remove(norm_abs_path)

@cthoyt
Copy link
Contributor

cthoyt commented Jan 19, 2022

Here's our off the shelf gene (and other entity) normalizer that's ready for use: https://github.com/indralab/gilda

@mjeensung
Copy link
Contributor

@dhimmel, that's correct.

For genes, mentions are recognized by the BERN2 NER model (better performance than GNormPlus) and normalized by GNormPlus.

@mjeensung
Copy link
Contributor

Thank you for recommending this great tool, @cthoyt.
We will look into the tool, Gilda, and see if we can incorporate it into BERN2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants