# Database module

The database module allows users to query relevant database APIs to extract clinically relevant mutational data and protein annotations from various sources, including:
+ **KinHub:** curated list of human kinases
+ **UniProt:** obtain canonical protein sequence information
+ **Pfam:** annotate protein domains
+ **HGNC:** standardize gene naming conventions
+ **KLIFS:** kinase-ligand interaction annotations
+ **cBioPortal:**  multi-institutional repository of sequencing data for cancer genomics

## Use `requests_cache` to minimize number of requests

In [1]:
from os import path

from mkt.databases.config import set_request_cache
from mkt.databases.io_utils import get_repo_root

In [2]:
try:
    set_request_cache(path.join(get_repo_root(), "requests_cache.sqlite"))
except:
    set_request_cache(path.join(".", "requests_cache.sqlite"))

## KinHub

In this vignette, we will extract a list of human kinases obtained from [KinHub](http://www.kinhub.org/kinases.html) for which we wish to obtain additional protein annotations and extract corresponding mutations from a cBioPortal cohort.

In [3]:
from mkt.databases import scrapers

In [4]:
df_kinhub = scrapers.kinhub()

In [KinHub](http://www.kinhub.org/kinases.html), 13 kinases possess more than 1 kinase domain so are listed as separate entries despite possessing a single UniProt ID. To remedy this, we have aggregated entries by common HGNC gene names so that each entry represents a unique protein rather than a unique kinase domain.

In [5]:
df_kinhub.loc[df_kinhub["Manning Name"].apply(lambda x: "DOMAIN2" in x.upper()), ]

Unnamed: 0,HGNC Name,xName,Manning Name,Kinase Name,Group,Family,SubFamily,UniprotID
141,JAK1,"JAK1, JAK1_b","JAK1, Domain2_JAK1",Tyrosine-protein kinase JAK1,TK,"JakB, Jak",,P23458
142,JAK2,"JAK2_b, JAK2","Domain2_JAK2, JAK2",Tyrosine-protein kinase JAK2,TK,"JakB, Jak",,O60674
143,JAK3,"JAK3, JAK3_b","JAK3, Domain2_JAK3",Tyrosine-protein kinase JAK3,TK,"JakB, Jak",,P52333
191,RPS6KA5,"MSK1, MSK1_b","MSK1, Domain2_MSK1",Ribosomal protein S6 kinase alpha-5,"AGC, CAMK","RSKb, RSK","MSK, MSKb",O75582
192,RPS6KA4,"MSK2_b, MSK2","Domain2_MSK2, MSK2",Ribosomal protein S6 kinase alpha-4,"AGC, CAMK","RSKb, RSK","MSK, MSKb",O75676
268,RPS6KA2,"RSK3_b, RSK3","RSK1, Domain2_RSK1",Ribosomal protein S6 kinase alpha-2,"AGC, CAMK","RSKb, RSK","RSKp90, RSKb",Q15349
269,RPS6KA3,"RSK2_b, RSK2","RSK2, Domain2_RSK2",Ribosomal protein S6 kinase alpha-3,"AGC, CAMK","RSKb, RSK","RSKp90, RSKb",P51812
270,RPS6KA1,"RSK1, RSK1_b","Domain2_RSK3, RSK3",Ribosomal protein S6 kinase alpha-1,"AGC, CAMK","RSKb, RSK","RSKp90, RSKb",Q15418
305,TYK2,"TYK2_b, TYK2","Domain2_TYK2, TYK2",Non-receptor tyrosine-protein kinase TYK2,TK,"JakB, Jak",,P29597
395,EIF2AK4,"GCN2_b, GCN2","Domain2_GCN2, GCN2",Eukaryotic translation initiation factor 2-alp...,"Other, STE","STE-Unique, PEK","nan, GCN2",Q9P2K8


## Uniprot

In [6]:
from mkt.databases import uniprot

In [7]:
uniprot.UniProtFASTA(df_kinhub["UniprotID"][0])._sequence

'MLEICLKLVGCKSKKGLSSSSSCYLEEALQRPVASDFEPQGLSEAARWNSKENLLAGPSENDPNLFVALYDFVASGDNTLSITKGEKLRVLGYNHNGEWCEAQTKNGQGWVPSNYITPVNSLEKHSWYHGPVSRNAAEYLLSSGINGSFLVRESESSPGQRSISLRYEGRVYHYRINTASDGKLYVSSESRFNTLAELVHHHSTVADGLITTLHYPAPKRNKPTVYGVSPNYDKWEMERTDITMKHKLGGGQYGEVYEGVWKKYSLTVAVKTLKEDTMEVEEFLKEAAVMKEIKHPNLVQLLGVCTREPPFYIITEFMTYGNLLDYLRECNRQEVNAVVLLYMATQISSAMEYLEKKNFIHRDLAARNCLVGENHLVKVADFGLSRLMTGDTYTAHAGAKFPIKWTAPESLAYNKFSIKSDVWAFGVLLWEIATYGMSPYPGIDLSQVYELLEKDYRMERPEGCPEKVYELMRACWQWNPSDRPSFAEIHQAFETMFQESSISDEVEKELGKQGVRGAVSTLLQAPELPTKTRTSRRAAEHRDTTDVPEMPHSKGQGESDPLDHEPAVSPLLPRKERGPPEGGLNEDERLLPKDKKTNLFSALIKKKKKTAPTPPKRSSSFREMDGQPERRGAGEEEGRDISNGALAFTPLDTADPAKSPKPSNGAGVPNGALRESGGSGFRSPHLWKKSSTLTSSRLATGEEEGGGSSSKRFLRSCSASCVPHGAKDTEWRSVTLPRDLQSTGRQFDSSTFGGHKSEKPALPRKRAGENRSDQVTRGTVTPPPRLVKKNEEAADEVFKDIMESSPGSSPPNLTPKPLRRQVTVAPASGLPHKEEAGKGSALGTPAAAEPVTPTSKAGSGAPGGTSKGPAEESRVRRHKHSSESPGRDKGKLSRLKPAPPPPPAASAGKAGGKPSQSPSQEAAGEAVLGAKTKATSLVDAVNSDAAKPSQPGEGLKKPVLPATPKPQSAKPSGTPISPAPVPSTLPSASSALAGDQPSS

## Pfam

In [8]:
from mkt.databases import pfam

In [9]:
pfam.Pfam(df_kinhub["UniprotID"][0])._pfam

Unnamed: 0,uniprot,protein_length,source_database,organism,in_alphafold,pfam_accession,name,source_database.1,type,integrated,member_databases,go_terms,start,end,dc-status,representative,model,score
0,p00519,1130,reviewed,9606,True,PF00017,SH2 domain,pfam,domain,IPR000980,,,127,202,CONTINUOUS,False,PF00017,9.3e-21
1,p00519,1130,reviewed,9606,True,PF00018,SH3 domain,pfam,domain,IPR001452,,,67,113,CONTINUOUS,False,PF00018,2e-09
2,p00519,1130,reviewed,9606,True,PF07714,Protein tyrosine and serine/threonine kinase,pfam,domain,IPR001245,,,242,492,CONTINUOUS,False,PF07714,1.4e-97
3,p00519,1130,reviewed,9606,True,PF08919,F-actin binding,pfam,domain,IPR015015,,,1026,1130,CONTINUOUS,False,PF08919,3.5000000000000004e-28


## HGNC

In this example, KinHub provides UniProt IDs for each entry. UniProt IDs are needed to query the UniProt API. However, if you need to retrieve UniProt IDs from HGNC gene names, Ensembl IDs, or other identifiers, the HGNC module can be used to interrogate HGNC's [Genename API](https://www.genenames.org/help/rest/).

In [10]:
from mkt.databases import hgnc

In [11]:
hgnc.HGNC("Akt1").maybe_get_info_from_hgnc_fetch(["uniprot_ids"])["uniprot_ids"][0][0]

'P31749'

## KLIFS

The input for the KLIFS database's `kinase_information` endpoint is `kinase_name` and can include HGNC gene name or UniProt ID.

In [12]:
from mkt.databases import klifs

In [13]:
klifs.KinaseInfo(search_term="Akt1").get_kinase_info()

[{'kinase_ID': 1,
  'name': 'AKT1',
  'gene_name': 'AKT1',
  'family': 'Akt',
  'group': 'AGC',
  'subfamily': '',
  'species': 'Human',
  'full_name': 'v-akt murine thymoma viral oncogene homolog 1',
  'uniprot': 'P31749',
  'iuphar': 1479,
  'pocket': 'KLLGKGTFGKVILYAMKILHTLTENRVLQNSRPFLTALKYSCFVMEYANGGELFFHLSRLHSEKNVVYRDLKLENLMLITDFGLC'}]

In [14]:
klifs.KinaseInfo(search_term="P31749").get_kinase_info()

[{'kinase_ID': 1,
  'name': 'AKT1',
  'gene_name': 'AKT1',
  'family': 'Akt',
  'group': 'AGC',
  'subfamily': '',
  'species': 'Human',
  'full_name': 'v-akt murine thymoma viral oncogene homolog 1',
  'uniprot': 'P31749',
  'iuphar': 1479,
  'pocket': 'KLLGKGTFGKVILYAMKILHTLTENRVLQNSRPFLTALKYSCFVMEYANGGELFFHLSRLHSEKNVVYRDLKLENLMLITDFGLC'}]

## cBioPortal

In [15]:
import os
import pandas as pd
from mkt.databases import cbioportal

This module takes inputs from environmental variables. See the `config` module documentation for additional details. In this example, we will query the publicly available cBioPortal instance and the [Zehir, 2017](https://www.nature.com/articles/nm.4333) MSK-IMPACT sequencing cohort as the study of interest.

In [16]:
os.environ["CBIOPORTAL_INSTANCE"] = "www.cbioportal.org"
os.environ["OUTPUT_DIR"] = "."

In [17]:
study = "msk_impact_2017"
df_zehir = cbioportal.Mutations(study).get_cbioportal_cohort_mutations()
df_zehir.iloc[0]

No API token provided


alleleSpecificCopyNumber                                               None
aminoAcidChange                                                        None
center                                                                   NA
chr                                                                      14
driverFilter                                                           None
driverFilterAnnotation                                                 None
driverTiersFilter                                                      None
driverTiersFilterAnnotation                                            None
endPosition                                                       105246551
entrezGeneId                                                            207
keyword                                                   AKT1 E17 missense
molecularProfileId                                msk_impact_2017_mutations
mutationStatus                                                           NA
mutationType

GenomeNexus, which is used to annotate cBioPortal entries, uses the canonical UniProt sequence. As such, we can confirm that the `proteinChange` field numbering and corresponding amino acid aligns with the canonical UniProt sequence obtained from the `uniprot` module.

In [18]:
uniprot.UniProtFASTA("P31749")._sequence[17-1]

'E'

## Putting it all together

Ultimately, the goal of this package is to allow us to build relational databases that we can query to obtain any information needed for additional downstream analyses for a list of proteins (KinHub kinases, in this example).

In [19]:
from tqdm.notebook import tqdm

### UniProt

In [20]:
list_uniprot, list_hgnc, list_sequence = [], [], []

for index, row in tqdm(df_kinhub.iterrows(), total = df_kinhub.shape[0]):
    list_uniprot.append(row["UniprotID"])
    list_hgnc.append(row["HGNC Name"])
    list_sequence.append(uniprot.UniProtFASTA(row["UniprotID"])._sequence)

dict_uniprot = dict(zip(["uniprot_id", "hgnc_name", "canonical_sequence"], 
                        [list_uniprot, list_hgnc, list_sequence]))

df_uniprot = pd.DataFrame.from_dict(dict_uniprot)

  0%|          | 0/517 [00:00<?, ?it/s]

### Pfam

In [21]:
df_pfam = pd.DataFrame()
for index, row in tqdm(df_kinhub.iterrows(), total = df_kinhub.shape[0]):
    df_temp = pfam.Pfam(row["UniprotID"])._pfam
    df_pfam = pd.concat([df_pfam, df_temp]).reset_index(drop=True)

df_pfam["uniprot"] = df_pfam["uniprot"].str.upper()

  0%|          | 0/517 [00:00<?, ?it/s]

No PFAM domains found: B5MCJ9


### KLIFS

In [22]:
df_klifs = pd.DataFrame()
for _, row in tqdm(df_kinhub.iterrows(), total=df_kinhub.shape[0]):
    temp = klifs.KinaseInfo(row["UniprotID"], "uniprot").get_kinase_info()
    if temp is not None:
        df_temp = pd.DataFrame(temp[0], index=[0])
        df_klifs = pd.concat([df_klifs, df_temp]).reset_index(drop=True)

  0%|          | 0/517 [00:00<?, ?it/s]

Error Expected type to be dict for value [400, 'KLIFS error: An unknown kinase name was provided'] to unmarshal to a <class 'abc.Error'>.Was <class 'list'> instead. in query_kinase_info for P78527
Error Expected type to be dict for value [400, 'KLIFS error: An unknown kinase name was provided'] to unmarshal to a <class 'abc.Error'>.Was <class 'list'> instead. in query_kinase_info for Q12979
Error Expected type to be dict for value [400, 'KLIFS error: An unknown kinase name was provided'] to unmarshal to a <class 'abc.Error'>.Was <class 'list'> instead. in query_kinase_info for B5MCJ9
Error Expected type to be dict for value [400, 'KLIFS error: An unknown kinase name was provided'] to unmarshal to a <class 'abc.Error'>.Was <class 'list'> instead. in query_kinase_info for Q9Y5P4
Error Expected type to be dict for value [400, 'KLIFS error: An unknown kinase name was provided'] to unmarshal to a <class 'abc.Error'>.Was <class 'list'> instead. in query_kinase_info for P53004
Error Expected 

### Generating a serializable model using `Pydantic`

For more details of the contents of the the `KinaseInfo` object created using see this [notebook](./schema_demo.ipynb).

In [24]:
from mkt.databases import kinase_schema

df_merge = kinase_schema.concatenate_source_dataframe()
dict_kin = kinase_schema.create_kinase_models_from_df(df_merge)