## HetioNet for BioBLP


- Load HetioNet
- Extract lists of entities of interest: Genes (BioKG: Proteins), Compounds (BioKG: Drugs), Diseases
- Extracts triples

### From gene to protein

- GeneID from HGCN ---> UniprotID of canonical isomorf ---> AA sequence




### From AA sequence to embedding (moved to separate notebook)

- Load ProtTrans
- Parse AA sequences
- Store embeddings in the form <uniprot_id: str, embedding: Tensor>



In [1]:
import pyobo
import pandas as pd
import time

import json
import hetnetpy
from pathlib import Path

### Load HetNet

In [2]:
from hetnetpy.readwrite import open_read_file, load

In [3]:
hetio_data_path = Path('/Users/dalivas/Documents/Projects/hetionet/hetnet')

hetio_json_path = hetio_data_path.joinpath('/json/')



read_file = open_read_file('../data/hetionet/hetionet-v1.0.json')

hetio_data = load(read_file, formatting='json')

In [4]:
hetio_data['nodes']

[{'kind': 'Molecular Function',
  'identifier': 'GO:0031753',
  'name': 'endothelial differentiation G-protein coupled receptor binding',
  'data': {'source': 'Gene Ontology',
   'license': 'CC BY 4.0',
   'url': 'http://purl.obolibrary.org/obo/GO_0031753'}},
 {'kind': 'Side Effect',
  'identifier': 'C0023448',
  'name': 'Lymphocytic leukaemia',
  'data': {'source': 'UMLS via SIDER 4.1',
   'license': 'CC BY-NC-SA 4.0',
   'url': 'http://identifiers.org/umls/C0023448'}},
 {'kind': 'Gene',
  'identifier': 5345,
  'name': 'SERPINF2',
  'data': {'description': 'serpin peptidase inhibitor, clade F (alpha-2 antiplasmin, pigment epithelium derived factor), member 2',
   'source': 'Entrez Gene',
   'license': 'CC0 1.0',
   'url': 'http://identifiers.org/ncbigene/5345',
   'chromosome': '17'}},
 {'kind': 'Gene',
  'identifier': 9409,
  'name': 'PEX16',
  'data': {'description': 'peroxisomal biogenesis factor 16',
   'source': 'Entrez Gene',
   'license': 'CC0 1.0',
   'url': 'http://identifier

In [5]:
gene_dict = {}

disease_dict = {}

compound_dict = {}

for node in hetio_data['nodes']:
    if node['kind'] == 'Gene':
        gene_dict[node['identifier']] = node['data']['url']
        
        
    if node['kind'] == 'Disease':
        disease_dict[node['identifier']] = node['data']['url']
        
    if node['kind'] == 'Compound':
        compound_dict[node['identifier']] = node['data']['url']

In [6]:
len(gene_dict), len(disease_dict), len(compound_dict)

(20945, 137, 1552)

In [None]:
hetio_data['']

### HGCN data 

HUGO Gene Nomenclature Committee (HGCN) is a popular resource for approved human gene nomenclature containing ~42000 gene symbols and names and 1300+ gene families and sets.

We use it to get mappings between Genes and their corresponding canoncal protein isoform and get the AA sequence.



In [7]:
hgnc_mapping = pd.read_csv('../data/hetionet/hgnc_id_mapping.tsv', sep='\t')

In [8]:
hgnc_ncbi_uniprot = hgnc_mapping[['NCBI Gene ID', 'UniProt ID(supplied by UniProt)']]

# Drop N/A

hgnc_ncbi_uniprot = hgnc_ncbi_uniprot.dropna(axis=0, how='any')

hgnc_ncbi_uniprot['NCBI Gene ID'] = hgnc_ncbi_uniprot['NCBI Gene ID'].astype(int)

hgnc_ncbi_uniprot.head(5)

Unnamed: 0,NCBI Gene ID,UniProt ID(supplied by UniProt)
2,41,P78348
4,5999,P49798
5,8490,O15539
6,9628,P49758
7,6000,P49802


In [9]:
hgnc_ncbi_uniprot

Unnamed: 0,NCBI Gene ID,UniProt ID(supplied by UniProt)
2,41,P78348
4,5999,P49798
5,8490,O15539
6,9628,P49758
7,6000,P49802
...,...,...
48658,6003,O14921
48659,10636,O43566
48660,6004,O15492
48661,5997,P41220


#### Find the GeneIDs that are included in HetioNet


In [10]:
hgnc_ncbi_uniprot['NCBI Gene ID'] = hgnc_ncbi_uniprot['NCBI Gene ID'].apply(lambda x: gene_dict[x] 
                                                                            if x in gene_dict.keys() else None)




### Small issue with multiple mappings between a Gene and a Protein

In [11]:
hgnc_ncbi_uniprot[[', ' in x for x in hgnc_ncbi_uniprot['UniProt ID(supplied by UniProt)']]]

Unnamed: 0,NCBI Gene ID,UniProt ID(supplied by UniProt)
932,http://identifiers.org/ncbigene/7979,"P60896, Q6ZVN7"
1744,http://identifiers.org/ncbigene/706,"B1AH88, P30536"
2071,http://identifiers.org/ncbigene/7112,"P42166, P42167"
2205,http://identifiers.org/ncbigene/27433,"Q5JU69, Q8N2E6"
2241,http://identifiers.org/ncbigene/6955,"P0DSE1, P0DTU3"
2383,http://identifiers.org/ncbigene/6957,"P0DSE2, P0DTU4"
3932,http://identifiers.org/ncbigene/23499,"O94854, Q9UPN3"
4011,,"P61583, Q9HDB8, Q9HDB9"
4139,,"Q69383, Q69384, Q7LDI9, Q9BXR3, Q9Y6I0"
4589,http://identifiers.org/ncbigene/796,"P01258, P06881"


### We need to decide how to deal with this...

In [12]:
# After replacing HGNC keys with their respective URI, remove the ones for which we don't have one
hgnc_ncbi_uniprot = hgnc_ncbi_uniprot.dropna(axis=0, how='any')


# Confirm by checking for nan_rows
nan_rows = hgnc_ncbi_uniprot[hgnc_ncbi_uniprot['NCBI Gene ID'].isnull()]

In [13]:
print(f'number of nan rows: {len(nan_rows)}')

number of nan rows: 0


In [14]:
import torch

protein_emb = torch.load('../data/processed/prot_id_to_embedding.pt')

In [15]:
biokg_protein_id_list = protein_emb.index.to_list()

In [16]:
id_set_0 = set(biokg_protein_id_list)

In [17]:
id_set_1 = set(hgnc_ncbi_uniprot['UniProt ID(supplied by UniProt)'])

In [18]:
 len(id_set_0), len(id_set_1)

(82309, 18944)

In [19]:
difference = id_set_1.difference(id_set_0)

In [20]:
len(difference)

18056

In [21]:
testaki = pd.DataFrame(list(id_set_1))

In [22]:
testaki.to_csv('../data/hetionet/hetio_uniprotids.csv', sep=',',header=None, index=None)

In [23]:
len(id_set_1)

18944

In [24]:
pd.options.display.max_rows = 4000

In [83]:
hetio_uniprot_kb = pd.read_csv('../data/hetionet/hetio_unique_proteins.tsv', sep='\t')

In [84]:
hetio_uniprot_kb

Unnamed: 0,From,Entry,Reviewed,Entry Name,Protein names,Gene Names,Organism,Length,Sequence,Gene Ontology IDs
0,Q8NCL4,Q8NCL4,reviewed,GALT6_HUMAN,Polypeptide N-acetylgalactosaminyltransferase ...,GALNT6,Homo sapiens (Human),622,MRLLRRRHMPLRLAMVGCAFVLFLFLLHRDVSSREEATEKPWLKSL...,GO:0000139; GO:0004653; GO:0005794; GO:0006493...
1,Q68DI1,Q68DI1,reviewed,ZN776_HUMAN,Zinc finger protein 776,ZNF776,Homo sapiens (Human),518,MAAAALRPPAQGTVTFEDVAVNFSQEEWSLLSEAQRCLYHDVMLEN...,GO:0000978; GO:0000981; GO:0005634; GO:0006357...
2,Q86W28,Q86W28,reviewed,NALP8_HUMAN,"NACHT, LRR and PYD domains-containing protein ...",NLRP8 NALP8 NOD16 PAN4,Homo sapiens (Human),1048,MSDVNPPSDTPIPFSSSSTHSSHIPPWTFSCYPGSPCENGVMLYMR...,GO:0005524; GO:0005737; GO:0050727
3,Q6NSX1,Q6NSX1,reviewed,CCD70_HUMAN,Coiled-coil domain-containing protein 70,CCDC70,Homo sapiens (Human),233,MATPPFRLIRKMFSFKVSRWMGLACFRSLAASSPSIRQKKLMHKLQ...,GO:0005739; GO:0005886
4,Q9Y228,Q9Y228,reviewed,T3JAM_HUMAN,TRAF3-interacting JNK-activating modulator (TR...,TRAF3IP3 T3JAM,Homo sapiens (Human),551,MISPDPRPSPGLARWAESYEAKCERRQEIRESRRCRPNVTTCRQVG...,GO:0000139; GO:0005741; GO:0005765; GO:0005886...
...,...,...,...,...,...,...,...,...,...,...
20031,Q99973,Q99973,reviewed,TEP1_HUMAN,Telomerase protein component 1 (Telomerase-ass...,TEP1 TLP1 TP1,Homo sapiens (Human),2627,MEKLHGHVSAHPDILSLENRCLAMLPDLQPLEKLHQHVSTHSDILS...,GO:0000722; GO:0000781; GO:0002039; GO:0003720...
20032,P60981,P60981,reviewed,DEST_HUMAN,Destrin (Actin-depolymerizing factor) (ADF),DSTN ACTDP DSN,Homo sapiens (Human),165,MASGVQVADEVCRIFYDMKVRKCSTPEEIKKRKKAVIFCLSADKKC...,GO:0005737; GO:0008154; GO:0015629; GO:0030042...
20033,Q8NFP0,Q8NFP0,reviewed,PXT1_HUMAN,Peroxisomal testis-specific protein 1 (Small t...,PXT1 STEPP,Homo sapiens (Human),134,MKKKHDGIVYETKEVLNPSPKVTHCCKSLWLKYSFQKAYMTQLVSS...,GO:0005634; GO:0005777; GO:0043065
20034,Q9H3N1,Q9H3N1,reviewed,TMX1_HUMAN,Thioredoxin-related transmembrane protein 1 (T...,TMX1 TMX TXNDC TXNDC1 PSEC0085 UNQ235/PRO268,Homo sapiens (Human),280,MAPSGSLAVPLAVLVLLLWGAPWTHGRRSNVRVITDENWRELLEGD...,GO:0005789; GO:0012505; GO:0015036; GO:0016021...


In [85]:
protein_sequences = hetio_uniprot_kb[['From', 'Sequence']]

In [82]:
protein_sequences = protein_sequences.set_index('From')

### Got the corresponding canonical sequence. Now we embed it.

## Diseases 

- HetioNet contains DOIDs
- Load DOID on GraphDB
    - Get DOID to MESH mappings
- Save DOID - MESH to be used to get embeddings later on

In [121]:
chebi_id_to_name = pyobo.get_id_name_mapping('doid', strict=False)



In [None]:
pyobo.get_primary_identifier('doid', '5583')

In [122]:
doid_mesh = pd.read_csv('../data/hetionet/doid_mesh_mappings.csv')

In [123]:
doid_mesh

Unnamed: 0,s,doid,label,xrefs
0,http://purl.obolibrary.org/obo/DOID_4,DOID:4,disease,MESH:D004194
1,http://purl.obolibrary.org/obo/DOID_0001816,DOID:0001816,angiosarcoma,MESH:D006394
2,http://purl.obolibrary.org/obo/DOID_175,DOID:175,vascular cancer,MESH:D019043
3,http://purl.obolibrary.org/obo/DOID_10124,DOID:10124,corneal disease,MESH:D003316
4,http://purl.obolibrary.org/obo/DOID_0014667,DOID:0014667,disease of metabolism,MESH:D008659
5,http://purl.obolibrary.org/obo/DOID_3042,DOID:3042,allergic contact dermatitis,MESH:D017449
6,http://purl.obolibrary.org/obo/DOID_3818,DOID:3818,photoallergic dermatitis,MESH:D017454
7,http://purl.obolibrary.org/obo/DOID_874,DOID:874,bacterial pneumonia,MESH:D018410
8,http://purl.obolibrary.org/obo/DOID_104,DOID:104,bacterial infectious disease,MESH:D001424
9,http://purl.obolibrary.org/obo/DOID_11266,DOID:11266,Hantavirus hemorrhagic fever with renal syndrome,MESH:D006480


In [124]:
doid_mesh['exists'] = doid_mesh['doid'].apply(lambda x: True if x in disease_dict.keys() else False)

In [125]:
len(disease_dict)

137

In [126]:
doid_mesh[doid_mesh['exists']==True]

Unnamed: 0,s,doid,label,xrefs,exists
2,http://purl.obolibrary.org/obo/DOID_175,DOID:175,vascular cancer,MESH:D019043,True
42,http://purl.obolibrary.org/obo/DOID_0050156,DOID:0050156,idiopathic pulmonary fibrosis,MESH:D054990,True
83,http://purl.obolibrary.org/obo/DOID_1459,DOID:1459,hypothyroidism,MESH:D007037,True
92,http://purl.obolibrary.org/obo/DOID_0050425,DOID:0050425,restless legs syndrome,MESH:D012148,True
188,http://purl.obolibrary.org/obo/DOID_1686,DOID:1686,glaucoma,MESH:D005901,True
227,http://purl.obolibrary.org/obo/DOID_1612,DOID:1612,breast cancer,MESH:D001943,True
245,http://purl.obolibrary.org/obo/DOID_1826,DOID:1826,epilepsy,MESH:D004827,True
253,http://purl.obolibrary.org/obo/DOID_0050742,DOID:0050742,nicotine dependence,MESH:D014029,True
260,http://purl.obolibrary.org/obo/DOID_332,DOID:332,amyotrophic lateral sclerosis,MESH:D000690,True
276,http://purl.obolibrary.org/obo/DOID_2377,DOID:2377,multiple sclerosis,MESH:D009103,True
