# Database module

The database module allows users to query relevant database APIs to extract clinically relevant mutational data and protein annotations from various sources, including:
+ **KinHub:** curated list of human kinases
+ **UniProt:** obtain canonical protein sequence information
+ **Pfam:** annotate protein domains
+ **HGNC:** standardize gene naming conventions
+ **KLIFS:** kinase-ligand interaction annotations
+ **cBioPortal:**  multi-institutional repository of sequencing data for cancer genomics

## KinHub

In this vignette, we will extract a list of human kinases obtained from [KinHub](http://www.kinhub.org/kinases.html) for which we wish to obtain additional protein annotations and extract corresponding mutations from a cBioPortal cohort.

In [1]:
from missense_kinase_toolkit.databases import scrapers

In [2]:
df_kinhub = scrapers.kinhub()

In [KinHub](http://www.kinhub.org/kinases.html), 13 kinases possess more than 1 kinase domain so are listed as separate entries despite possessing a single UniProt ID. To remedy this, we have aggregated entries by common HGNC gene names so that each entry represents a unique protein rather than a unique kinase domain.

In [3]:
df_kinhub.loc[df_kinhub["Manning Name"].apply(lambda x: "DOMAIN2" in x.upper()), ]

Unnamed: 0,HGNC Name,xName,Manning Name,Kinase Name,Group,Family,SubFamily,UniprotID
141,JAK1,"JAK1, JAK1_b","JAK1, Domain2_JAK1",Tyrosine-protein kinase JAK1,TK,"JakB, Jak",,P23458
142,JAK2,"JAK2, JAK2_b","JAK2, Domain2_JAK2",Tyrosine-protein kinase JAK2,TK,"JakB, Jak",,O60674
143,JAK3,"JAK3_b, JAK3","Domain2_JAK3, JAK3",Tyrosine-protein kinase JAK3,TK,"JakB, Jak",,P52333
191,RPS6KA5,"MSK1_b, MSK1","Domain2_MSK1, MSK1",Ribosomal protein S6 kinase alpha-5,"AGC, CAMK","RSKb, RSK","MSKb, MSK",O75582
192,RPS6KA4,"MSK2, MSK2_b","MSK2, Domain2_MSK2",Ribosomal protein S6 kinase alpha-4,"AGC, CAMK","RSKb, RSK","MSKb, MSK",O75676
268,RPS6KA2,"RSK3_b, RSK3","Domain2_RSK1, RSK1",Ribosomal protein S6 kinase alpha-2,"AGC, CAMK","RSKb, RSK","RSKp90, RSKb",Q15349
269,RPS6KA3,"RSK2_b, RSK2","Domain2_RSK2, RSK2",Ribosomal protein S6 kinase alpha-3,"AGC, CAMK","RSKb, RSK","RSKp90, RSKb",P51812
270,RPS6KA1,"RSK1_b, RSK1","Domain2_RSK3, RSK3",Ribosomal protein S6 kinase alpha-1,"AGC, CAMK","RSKb, RSK","RSKp90, RSKb",Q15418
305,TYK2,"TYK2_b, TYK2","Domain2_TYK2, TYK2",Non-receptor tyrosine-protein kinase TYK2,TK,"JakB, Jak",,P29597
395,EIF2AK4,"GCN2, GCN2_b","GCN2, Domain2_GCN2",Eukaryotic translation initiation factor 2-alp...,"STE, Other","PEK, STE-Unique","GCN2, nan",Q9P2K8


## Uniprot

In [4]:
from missense_kinase_toolkit.databases import uniprot

In [5]:
uniprot.UniProt(df_kinhub["UniprotID"][0])._sequence

'MLEICLKLVGCKSKKGLSSSSSCYLEEALQRPVASDFEPQGLSEAARWNSKENLLAGPSENDPNLFVALYDFVASGDNTLSITKGEKLRVLGYNHNGEWCEAQTKNGQGWVPSNYITPVNSLEKHSWYHGPVSRNAAEYLLSSGINGSFLVRESESSPGQRSISLRYEGRVYHYRINTASDGKLYVSSESRFNTLAELVHHHSTVADGLITTLHYPAPKRNKPTVYGVSPNYDKWEMERTDITMKHKLGGGQYGEVYEGVWKKYSLTVAVKTLKEDTMEVEEFLKEAAVMKEIKHPNLVQLLGVCTREPPFYIITEFMTYGNLLDYLRECNRQEVNAVVLLYMATQISSAMEYLEKKNFIHRDLAARNCLVGENHLVKVADFGLSRLMTGDTYTAHAGAKFPIKWTAPESLAYNKFSIKSDVWAFGVLLWEIATYGMSPYPGIDLSQVYELLEKDYRMERPEGCPEKVYELMRACWQWNPSDRPSFAEIHQAFETMFQESSISDEVEKELGKQGVRGAVSTLLQAPELPTKTRTSRRAAEHRDTTDVPEMPHSKGQGESDPLDHEPAVSPLLPRKERGPPEGGLNEDERLLPKDKKTNLFSALIKKKKKTAPTPPKRSSSFREMDGQPERRGAGEEEGRDISNGALAFTPLDTADPAKSPKPSNGAGVPNGALRESGGSGFRSPHLWKKSSTLTSSRLATGEEEGGGSSSKRFLRSCSASCVPHGAKDTEWRSVTLPRDLQSTGRQFDSSTFGGHKSEKPALPRKRAGENRSDQVTRGTVTPPPRLVKKNEEAADEVFKDIMESSPGSSPPNLTPKPLRRQVTVAPASGLPHKEEAGKGSALGTPAAAEPVTPTSKAGSGAPGGTSKGPAEESRVRRHKHSSESPGRDKGKLSRLKPAPPPPPAASAGKAGGKPSQSPSQEAAGEAVLGAKTKATSLVDAVNSDAAKPSQPGEGLKKPVLPATPKPQSAKPSGTPISPAPVPSTLPSASSALAGDQPSS

## Pfam

In [6]:
from missense_kinase_toolkit.databases import pfam

In [7]:
pfam.Pfam(df_kinhub["UniprotID"][0])._pfam

Unnamed: 0,uniprot,protein_length,source_database,organism,in_alphafold,pfam_accession,name,source_database.1,type,integrated,member_databases,go_terms,start,end,dc-status,representative,representative.1,model,score
0,p00519,1130,reviewed,9606,True,PF00017,SH2 domain,pfam,domain,IPR000980,,,127,202,CONTINUOUS,False,False,PF00017,9.3e-21
1,p00519,1130,reviewed,9606,True,PF00018,SH3 domain,pfam,domain,IPR001452,,,67,113,CONTINUOUS,False,False,PF00018,2e-09
2,p00519,1130,reviewed,9606,True,PF07714,Protein tyrosine and serine/threonine kinase,pfam,domain,IPR001245,,,242,492,CONTINUOUS,False,False,PF07714,1.4e-97
3,p00519,1130,reviewed,9606,True,PF08919,F-actin binding,pfam,domain,IPR015015,,,1026,1130,CONTINUOUS,False,False,PF08919,3.5000000000000004e-28


## HGNC

In this example, KinHub provides UniProt IDs for each entry. UniProt IDs are needed to query the UniProt API. However, if you need to retrieve UniProt IDs from HGNC gene names, Ensembl IDs, or other identifiers, the HGNC module can be used to interrogate HGNC's [Genename API](https://www.genenames.org/help/rest/).

In [8]:
from missense_kinase_toolkit.databases import hgnc

In [9]:
hgnc.HGNC("Abl1").maybe_get_info_from_hgnc_fetch(["uniprot_ids"])["uniprot_ids"][0][0]

'P00519'

## KLIFS

The input for the KLIFS database's `kinase_information` endpoint is `kinase_name` and can include HGNC gene name or UniProt ID.

In [10]:
from missense_kinase_toolkit.databases import klifs

In [11]:
klifs.KinaseInfo("Abl1")._kinase_info

{'family': 'Abl',
 'full_name': 'ABL proto-oncogene 1, non-receptor tyrosine kinase',
 'gene_name': 'ABL1',
 'group': 'TK',
 'iuphar': 1923,
 'kinase_ID': 392,
 'name': 'ABL1',
 'pocket': 'HKLGGGQYGEVYEVAVKTLEFLKEAAVMKEIKPNLVQLLGVYIITEFMTYGNLLDYLREYLEKKNFIHRDLAARNCLVVADFGLS',
 'species': 'Human',
 'subfamily': '',
 'uniprot': 'P00519'}

In [12]:
klifs.KinaseInfo(df_kinhub["UniprotID"][0])._kinase_info

{'family': 'Abl',
 'full_name': 'ABL proto-oncogene 1, non-receptor tyrosine kinase',
 'gene_name': 'ABL1',
 'group': 'TK',
 'iuphar': 1923,
 'kinase_ID': 392,
 'name': 'ABL1',
 'pocket': 'HKLGGGQYGEVYEVAVKTLEFLKEAAVMKEIKPNLVQLLGVYIITEFMTYGNLLDYLREYLEKKNFIHRDLAARNCLVVADFGLS',
 'species': 'Human',
 'subfamily': '',
 'uniprot': 'P00519'}

## cBioPortal

In [13]:
import os
import pandas as pd
from missense_kinase_toolkit.databases import cbioportal

This module takes inputs from environmental variables. See the `config` module documentation for additional details. In this example, we will query the publicly available cBioPortal instance and the [Zehir, 2017](https://www.nature.com/articles/nm.4333) MSK-IMPACT sequencing cohort as the study of interest.

In [14]:
os.environ["CBIOPORTAL_INSTANCE"] = "www.cbioportal.org"
os.environ["OUTPUT_DIR"] = "."

In [15]:
study = "msk_impact_2017"
cbioportal.Mutations(study).get_and_save_cbioportal_cohort_mutations()
df_zehir = pd.read_csv(f"{study}_mutations.csv")
df_zehir.iloc[0]

alleleSpecificCopyNumber                                                NaN
aminoAcidChange                                                         NaN
center                                                                  NaN
chr                                                                       9
driverFilter                                                            NaN
driverFilterAnnotation                                                  NaN
driverTiersFilter                                                       NaN
driverTiersFilterAnnotation                                             NaN
endPosition                                                       133760514
entrezGeneId                                                             25
keyword                                                     ABL1 truncating
molecularProfileId                                msk_impact_2017_mutations
mutationStatus                                                          NaN
mutationType

GenomeNexus, which is used to annotate cBioPortal entries, uses the canonical UniProt sequence. As such, we can confirm that the `proteinChange` field numbering and corresponding amino acid aligns with the canonical UniProt sequence obtained from the `uniprot` module.

In [16]:
uniprot.UniProt(df_kinhub["UniprotID"][0])._sequence[947-1]

'K'

## Putting it all together

Ultimately, the goal of this package is to allow us to build relational databases that we can query to obtain any information needed for additional downstream analyses for a list of proteins (KinHub kinases, in this example).

In [17]:
from tqdm.notebook import tqdm

### UniProt

In [18]:
list_uniprot, list_hgnc, list_sequence = [], [], []

for index, row in tqdm(df_kinhub.iterrows(), total = df_kinhub.shape[0]):
    list_uniprot.append(row["UniprotID"])
    list_hgnc.append(row["HGNC Name"])
    list_sequence.append(uniprot.UniProt(row["UniprotID"])._sequence)

dict_uniprot = dict(zip(["uniprot_id", "hgnc_name", "canonical_sequence"], 
                        [list_uniprot, list_hgnc, list_sequence]))

df_uniprot = pd.DataFrame.from_dict(dict_uniprot)

  0%|          | 0/517 [00:00<?, ?it/s]

### Pfam

In [19]:
df_pfam = pd.DataFrame()
for index, row in tqdm(df_kinhub.iterrows(), total = df_kinhub.shape[0]):
    df_temp = pfam.Pfam(row["UniprotID"])._pfam
    df_pfam = pd.concat([df_pfam, df_temp]).reset_index(drop=True)

df_pfam["uniprot"] = df_pfam["uniprot"].str.upper()

  0%|          | 0/517 [00:00<?, ?it/s]

No PFAM domains found: B5MCJ9


### KLIFS

In [20]:
df_klifs = pd.DataFrame()
for index, row in tqdm(df_kinhub.iterrows(), total = df_kinhub.shape[0]):
    df_temp = pd.DataFrame(klifs.KinaseInfo(row["HGNC Name"])._kinase_info, index=[0])
    df_klifs = pd.concat([df_klifs, df_temp]).reset_index(drop=True)

  0%|          | 0/517 [00:00<?, ?it/s]

Error in query_kinase_info for ADRBK1:
Expected type to be dict for value [400, 'KLIFS error: An unknown kinase name was provided'] to unmarshal to a <class 'abc.Error'>.Was <class 'list'> instead.
Error in query_kinase_info for PRKDC:
Expected type to be dict for value [400, 'KLIFS error: An unknown kinase name was provided'] to unmarshal to a <class 'abc.Error'>.Was <class 'list'> instead.
Error in query_kinase_info for ADRBK2:
Expected type to be dict for value [400, 'KLIFS error: An unknown kinase name was provided'] to unmarshal to a <class 'abc.Error'>.Was <class 'list'> instead.
Error in query_kinase_info for PAK7:
Expected type to be dict for value [400, 'KLIFS error: An unknown kinase name was provided'] to unmarshal to a <class 'abc.Error'>.Was <class 'list'> instead.
Error in query_kinase_info for GSG2:
Expected type to be dict for value [400, 'KLIFS error: An unknown kinase name was provided'] to unmarshal to a <class 'abc.Error'>.Was <class 'list'> instead.
Error in query_

In [21]:
df_klifs.head()

Unnamed: 0,family,full_name,gene_name,group,iuphar,kinase_ID,name,pocket,species,subfamily,uniprot
0,Abl,"ABL proto-oncogene 1, non-receptor tyrosine ki...",ABL1,TK,1923,392,ABL1,HKLGGGQYGEVYEVAVKTLEFLKEAAVMKEIKPNLVQLLGVYIITE...,Human,,P00519
1,Ack,"tyrosine kinase, non-receptor, 2",TNK2,TK,2246,394,ACK,EKLGDGSFGVVRRVAVKCLDFIREVNAMHSLDRNLIRLYGVKMVTE...,Human,,Q07912
2,STKR,activin A receptor type IIA,ACVR2A,TKL,1791,523,ACTR2,EVKARGRFGCVWKVAVKIFSWQNEYEVYSLPGENILQFIGAWLITA...,Human,Type2,P27037
3,STKR,activin A receptor type IIB,ACVR2B,TKL,1792,524,ACTR2B,EIKARGRFGCVWKVAVKIFSWQSEREIFSTPGENLLQFIAAWLITA...,Human,Type2,Q13705
4,ABC1,"Atypical kinase COQ8B, mitochondrial",COQ8B,Atypical,1928,71,ADCK4,VPFAAASIGQVHQVAVKIQDYRREAACAQNFRFRVPAVVKETRVLG...,Human,ABC1-,Q96D53


### Save dataframes

In [23]:
df_uniprot.to_csv("kinhub_uniprot.csv")
df_pfam.to_csv("kinhub_pfam.csv")
df_klifs.to_csv("kihub_klifs.csv")