How to improve named entity normalization for human proteins? #2

dhimmel · 2022-01-14T23:00:44Z

Very excited to see BERN2! Really nice work so far.

I'm looking to map certain mentions of proteins to standard identifiers. Here's a list of these proteins, where each protein is also followed by a direction of activity:

3 beta hydroxysteroid dehydrogenase 5 stimulator; AF4/FMR2 protein 2 inhibitor; Adenylate cyclase 2 stimulator; Alpha gamma adaptin binding protein p34 stimulator; BR serine threonine protein kinase 1 stimulator; Complement Factor B stimulator; DNA gyrase B inhibitor; Ectonucleotide pyrophosphatase-PDE-3 stimulator; Falcipain 1 stimulator; Homeobox protein Nkx 2.4 stimulator; ISLR protein inhibitor; Integrin alpha-IIb/beta-4 antagonist; Inter alpha trypsin inhibitor H5 stimulator; Interleukin receptor 17B antagonist; Isopropylmalate dehydrogenase stimulator; Methylthioadenosine nucleosidase stimulator; Patched domain containing protein 2 inhibitor; Protein FAM161A stimulator; Protocadherin gamma A1 inhibitor; Ring finger protein 4 stimulator; SMAD-9 inhibitor; Small ubiquitin related modifier 1 inhibitor; Sodium-dicarboxylate cotransporter-1 inhibitor; Sorting nexin 9 inhibitor; Sugar phosphate exchanger 2 stimulator; Transcription factor p65 stimulator; Tumor necrosis factor 14 ligand inhibitor; Ubiquitin-conjugating enzyme E21 stimulator; Unspecified ion channel inhibitor; Zinc finger BED domain protein 6 inhibitor

Using the nice web interface, I get:

So overall BERN2 does a good job recognizing the protein mentions. However, we actually already know what the protein text is, and are more interested in normalization. Most of the gene/protein mentions receive "ID: CUI-less". Any advice on how to improve the performance of named entity normalization for human proteins?

I see that the website notes that normalization is done by https://github.com/dmis-lab/BioSyn, so feel free to migrate this issue to that repo if it's best there.

mjeensung · 2022-01-18T03:55:01Z

Hi @dhimmel,
Thank you for your interest in BERN2.

BioSyn, the neural network normalizer, currently only supports disease and chemical types. Please note that we place an asterisk next to a CUI that has been normalized by 'BioSyn' (e.g., ID: MESH:D013217*).

For the gene/protein type, we are using an off-the-shelf gene type normalizer GNormPlus and the human proteins in your examples are the entities that GNormPlus could not normalize.

If a better gene/protein type normalizer is released in the future, we are planning to replace it with the current gene/protein type normalizer.

dhimmel · 2022-01-19T00:26:45Z

Thanks @mjeensung for the clarification. Feel free to post any leads on better gene/protein normalizers here... I'm happy to help evaluate.

Looking at the GNormPlus docs, it does "mention recognition and concept normalization". So are you able to just apply GNormPlus at the concept normalization stage for genes, while using the mention recognition from BERN2? I think the code I'm asking about is:

BERN2/bern2/normalizer.py

Lines 307 to 401 in 20cef6b

    
           # call GNormPlus 
        
           elif ent_type == 'gene': 
        
               # create socket 
        
               # s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) 
        
               s = socket.socket() 
        
               s.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1) 
        
               try: 
        
                   s.connect((self.HOST, self.GENE_PORT)) 
        
               except ConnectionRefusedError as cre: 
        
                   print('Check GNormPlus jar', cre) 
        
                   s.close() 
        
                   return oids 
        
               # 1. Write as input files to normalizers 
        
               norm_inp_path = os.path.join(self.NORM_INPUT_DIR[ent_type], 
        
                                            input_filename) 
        
               norm_abs_path = os.path.join(self.NORM_INPUT_DIR[ent_type], 
        
                                            base_thread_name + '.txt') 
        
               space_type = ' ' + ent_type 
        
               with open(norm_inp_path, 'w') as norm_inp_f: 
        
                   with open(norm_abs_path, 'w') as norm_abs_f: 
        
                       for saved_item in saved_items: 
        
                           entities = saved_item['entities'][ent_type] 
        
                           if len(entities) == 0: 
        
                               continue 
        
                           abstract_title = saved_item['abstract'] 
        
                           ent_names = list() 
        
                           for loc in entities: 
        
                               e_name = abstract_title[loc['start']:loc['end']] 
        
                               if len(e_name) > len(space_type) \ 
        
                                       and space_type \ 
        
                                       in e_name.lower()[-len(space_type):]: 
        
                                   # print('Replace', e_name, 
        
                                   #       'w/', e_name[:-len(space_type)]) 
        
                                   e_name = e_name[:-len(space_type)] 
        
                               ent_names.append(e_name) 
        
                           norm_abs_f.write(saved_item['pmid'] + '||' + 
        
                                            abstract_title + '\n') 
        
                           norm_inp_f.write('||'.join(ent_names) + '\n') 
        
               # 2. Run normalizers 
        
               gene_input_dir = os.path.abspath( 
        
                   os.path.join(self.NORM_INPUT_DIR[ent_type])) 
        
               gene_output_dir = os.path.abspath( 
        
                   os.path.join(self.NORM_OUTPUT_DIR[ent_type])) 
        
               setup_dir = self.NORM_DICT_PATH[ent_type] # setup.txt 
        
               # start jar 
        
               jar_args = '\t'.join( 
        
                   [gene_input_dir, gene_output_dir, setup_dir, '9606',  # human 
        
                    base_thread_name]) + '\n' 
        
               s.send(jar_args.encode('utf-8')) 
        
               # input_stream = struct.pack('>H', len(jar_args)) + jar_args.encode('utf-8') 
        
               # s.send(input_stream) 
        
               s.recv(bufsize) 
        
               s.close() 
        
               # 3. Read output files of normalizers 
        
               norm_out_path = os.path.join(gene_output_dir, output_filename) 
        
               if os.path.exists(norm_out_path): 
        
                   with open(norm_out_path, 'r') as norm_out_f, \ 
        
                           open(norm_inp_path, 'r') as norm_in_f: 
        
                       for line, input_l in zip(norm_out_f, norm_in_f): 
        
                           gene_ids, gene_mentions = line[:-1].split('||'), \ 
        
                                                     input_l[:-1].split('||') 
        
                           for gene_id, gene_mention in zip(gene_ids, 
        
                                                            gene_mentions): 
        
                               eid = None 
        
                               if gene_id.lower() == 'cui-less': 
        
                                   eid = self.NO_ENTITY_ID 
        
                               else: 
        
                                   bar_idx = gene_id.find('-') 
        
                                   if bar_idx > -1: 
        
                                       gene_id = gene_id[:bar_idx] 
        
                                   eid = gene_id 
        
                                   eid = "EntrezGene:" + eid 
        
                               oids.append(eid) 
        
                   # 5. Remove output files 
        
                   os.remove(norm_out_path) 
        
               else: 
        
                   print('Not found!!!', norm_out_path) 
        
                   # Sad error handling 
        
                   for _ in range(len(name_ptr)): 
        
                       oids.append(self.NO_ENTITY_ID) 
        
               # 4. Remove input files 
        
               os.remove(norm_inp_path) 
        
               os.remove(norm_abs_path)

cthoyt · 2022-01-19T00:45:47Z

Here's our off the shelf gene (and other entity) normalizer that's ready for use: https://github.com/indralab/gilda

mjeensung · 2022-01-19T03:04:28Z

@dhimmel, that's correct.

For genes, mentions are recognized by the BERN2 NER model (better performance than GNormPlus) and normalized by GNormPlus.

mjeensung · 2022-01-19T03:11:36Z

Thank you for recommending this great tool, @cthoyt.
We will look into the tool, Gilda, and see if we can incorporate it into BERN2.

mjeensung closed this as completed Jan 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to improve named entity normalization for human proteins? #2

How to improve named entity normalization for human proteins? #2

dhimmel commented Jan 14, 2022

mjeensung commented Jan 18, 2022

dhimmel commented Jan 19, 2022

cthoyt commented Jan 19, 2022

mjeensung commented Jan 19, 2022

mjeensung commented Jan 19, 2022

How to improve named entity normalization for human proteins? #2

How to improve named entity normalization for human proteins? #2

Comments

dhimmel commented Jan 14, 2022

mjeensung commented Jan 18, 2022

dhimmel commented Jan 19, 2022

cthoyt commented Jan 19, 2022

mjeensung commented Jan 19, 2022

mjeensung commented Jan 19, 2022